What is the main motivation of this project? The goal is to predict the number of applications for admission to American colleges based on 777 observations of 17 independent random variables. The purpose of the problem is to fit an optimal model to the data to better predict the variable named “Apps”.
What can the output of such a project be used for? Predicting the variable of the number of applications for admission to universities can be used in future educational planning and a better understanding of the variables affecting the number of applications of applicants, for example:
What variables have the greatest impact on the number of applications and examine the possibility of strengthening them in future educational planning.
What variables have the least impact on the number of applications for admission to universities.
What variables are not effective or, in other words, independent of “Apps” and examine the possibility of omitting them.
Who might be interested in the results of this project?
Where did the data come from and how was it collected? This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the ASA Statistical Graphics Section’s 1995 Data Analysis Exposition.
What do each of the variables measure?
Is there any ambiguity in the definitions of the data? Yes, there is some ambiguity in the definition or calculation of some variables such as the perc. alumni, Grad.Rate, Student-to-Faculty Ratio, and S.F.Ratio.
Is there an error in measuring variables or recording data? Yes, for various reasons such as:
What other variables, if any, could help solve the problem? Independent variables such as:
What are the existing variables (categorical-numerical)? Except for one variable named ”private”, which is a binary variable (discrete categorical variable), other variables are numerical.
require(moments) #Moments, skewness, kurtosis and related tests
require(corrplot) #Visualization of Correlation Matrix
require(car) #Required for Calculations Related to Regression
require(ggplot2) #Create Elegant Data Visualizations
require(MASS) #Box-Cox Transformations for Linear Models
require(leaps) #Regression Subset Selection
require(glmnet) #Lasso and Elastic-Net Regularized GLM
require(rpart) #Classification and Regression Trees
require(rpart.plot) #Plot Decision Tree
require(randomForest) #Random Forests for Classification and Regression
require(gbm) #Boosting Methods in Regression Problems
require(xgboost) #Boosting Methods in Regression Problems
Statistical Summary
data<-read.csv("/Volumes/USB DISK/project/college.csv",header = TRUE)
class(data)
## [1] "data.frame"
dim(data)
## [1] 777 19
head(data,2)
## College.Name Private Apps Accept Enroll Top10perc Top25perc
## 1 Abilene Christian University Yes 1660 1232 721 23 52
## 2 Adelphi University Yes 2186 1924 512 16 29
## F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal
## 1 2885 537 7440 3300 450 2200 70 78
## 2 2683 1227 12280 6450 750 1500 29 30
## S.F.Ratio perc.alumni Expend Grad.Rate
## 1 18.1 12 7041 60
## 2 12.2 16 10527 56
tail(data,2)
## College.Name Private Apps Accept Enroll Top10perc
## 776 Yale University Yes 10705 2453 1317 95
## 777 York College of Pennsylvania Yes 2989 1855 691 28
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 776 99 5217 83 19840 6510 630 2115 96
## 777 63 2988 1726 4990 3560 500 1250 75
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 776 96 5.8 49 40386 99
## 777 75 18.1 28 4509 99
College.Name
length(unique(data$College.Name))
## [1] 777
# No duplicate data
Private
class(data$Private)
## [1] "character"
data$Private<-factor(data$Private,levels = c("Yes","No")) #Convert to factor
levels(data$Private)<-c("Private","Not Private") #Rename factor levels
levels(data$Private)
## [1] "Private" "Not Private"
summary(data$Private)
## Private Not Private
## 565 212
##Distribution of Private
table(data$Private, useNA = "ifany")
##
## Private Not Private
## 565 212
# No missing value & number of private university is twice of number of public university.
barplot(table(data$Private),main = "distribution of state of university ", col ="Dark Blue" )
Apps (Dependent variable)
summary(data$Apps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 81 776 1558 3002 3624 48094
which(data$Apps==48094)
## [1] 484
#Frequency of max(data$Apps)=1, then may be it entered incorrectly.
#checking missing value of Apps:
sum(is.na(data$Apps))
## [1] 0
# No missing value
#Distribution of Apps
#1.histogram of Apps
hist(data$Apps,xlab="Number of applications received",main=paste("Histogram of", "App"))
#2.boxplot of Apps
boxplot(data$Apps,main="box Plot Apps")
boxplot(Apps~Private,data=data,main="box Plot of Apps according to type of university",xlab="Type of university")
#3.QQ-plot of Apps
qqnorm(data$Apps, main="qqplot of Apps",pch=20)
qqline(data$Apps,col="red")
#4Jarque-Bera Test(Skewness=0) and Anscombe-glynn Test(kurtosis=3)) for Apps
library(moments)
jarque.test(data$Apps)
##
## Jarque-Bera Normality Test
##
## data: data$Apps
## JB = 24687, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(data$Apps)
##
## Anscombe-Glynn kurtosis test
##
## data: data$Apps
## kurt = 29.595, z = 15.195, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#conclusion:Reject normality assumption for Apps distribution.
Other continuous independent variable
#checking missing value:
for (i in c(4:19)) print(c(names(data)[i],NumberofMissingValue=sum(is.na(data)[i])))
## NumberofMissingValue
## "Accept" "0"
## NumberofMissingValue
## "Enroll" "0"
## NumberofMissingValue
## "Top10perc" "0"
## NumberofMissingValue
## "Top25perc" "0"
## NumberofMissingValue
## "F.Undergrad" "0"
## NumberofMissingValue
## "P.Undergrad" "0"
## NumberofMissingValue
## "Outstate" "0"
## NumberofMissingValue
## "Room.Board" "0"
## NumberofMissingValue
## "Books" "0"
## NumberofMissingValue
## "Personal" "0"
## NumberofMissingValue
## "PhD" "0"
## NumberofMissingValue
## "Terminal" "0"
## NumberofMissingValue
## "S.F.Ratio" "0"
## NumberofMissingValue
## "perc.alumni" "0"
## NumberofMissingValue
## "Expend" "0"
## NumberofMissingValue
## "Grad.Rate" "0"
# No missing value
#distribution of Other continuous independent variable
#1.histogram of Other continuous independent variable
par(mar=c(2,2,2,2))
par(mfrow=c(4,4)) #4 rows and 4 columns
for(i in c(4:19)){
hist(data[,i],main=paste("Histogram of", names(data)[i]),col="Dark Blue")}
#2.boxplot of Other continuous independent variable
for(i in c(4:19)){
boxplot(data[,i],main=paste("box Plot of",names(data)[i]))
}
#3.QQ-plot of Other continuous independent variable
for(i in c(4:19)){
qqnorm(data[,i],main=paste("qqplot of",names(data)[i]),pch=20)
qqline(data[,i],col="red")}
par(mfrow=c(1,1))
#4Jarque-Bera Test(Skewness=0) for Other continuous independent variable
library(moments)
sapply(data[,4:19],jarque.test)
## Accept Enroll
## statistic 12960.1 3422.191
## p.value 0 0
## alternative "greater" "greater"
## method "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name "X[[i]]" "X[[i]]"
## Top10perc Top25perc
## statistic 412.3679 19.12887
## p.value 0 7.018076e-05
## alternative "greater" "greater"
## method "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name "X[[i]]" "X[[i]]"
## F.Undergrad P.Undergrad
## statistic 2768.508 100954.3
## p.value 0 0
## alternative "greater" "greater"
## method "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name "X[[i]]" "X[[i]]"
## Outstate Room.Board
## statistic 39.13872 30.61429
## p.value 3.170551e-09 2.25005e-07
## alternative "greater" "greater"
## method "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name "X[[i]]" "X[[i]]"
## Books Personal
## statistic 27209.39 2010.193
## p.value 0 0
## alternative "greater" "greater"
## method "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name "X[[i]]" "X[[i]]"
## PhD Terminal
## statistic 86.03721 87.76365
## p.value 0 0
## alternative "greater" "greater"
## method "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name "X[[i]]" "X[[i]]"
## S.F.Ratio perc.alumni
## statistic 265.8505 47.86244
## p.value 0 4.043932e-11
## alternative "greater" "greater"
## method "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name "X[[i]]" "X[[i]]"
## Expend Grad.Rate
## statistic 12796.29 3.119798
## p.value 0 0.2101573
## alternative "greater" "greater"
## method "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name "X[[i]]" "X[[i]]"
#Reject Normality assumption for all the varialbes, except variable named "Grad.Rate".
#Anscombe-glynn Test(kurtosis=3)) for Other continuous independent variable
sapply(data[,4:19],anscombe.test)
## Accept Enroll
## statistic numeric,2 numeric,2
## p.value 9.93411e-46 1.425782e-31
## alternative "kurtosis is not equal to 3" "kurtosis is not equal to 3"
## method "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name "X[[i]]" "X[[i]]"
## Top10perc Top25perc
## statistic numeric,2 numeric,2
## p.value 4.357408e-11 6.971441e-06
## alternative "kurtosis is not equal to 3" "kurtosis is not equal to 3"
## method "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name "X[[i]]" "X[[i]]"
## F.Undergrad P.Undergrad
## statistic numeric,2 numeric,2
## p.value 4.030631e-29 8.73106e-65
## alternative "kurtosis is not equal to 3" "kurtosis is not equal to 3"
## method "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name "X[[i]]" "X[[i]]"
## Outstate Room.Board
## statistic numeric,2 numeric,2
## p.value 0.003368399 0.2678832
## alternative "kurtosis is not equal to 3" "kurtosis is not equal to 3"
## method "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name "X[[i]]" "X[[i]]"
## Books Personal
## statistic numeric,2 numeric,2
## p.value 3.53884e-53 9.053454e-28
## alternative "kurtosis is not equal to 3" "kurtosis is not equal to 3"
## method "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name "X[[i]]" "X[[i]]"
## PhD Terminal
## statistic numeric,2 numeric,2
## p.value 0.007727027 0.1814422
## alternative "kurtosis is not equal to 3" "kurtosis is not equal to 3"
## method "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name "X[[i]]" "X[[i]]"
## S.F.Ratio perc.alumni
## statistic numeric,2 numeric,2
## p.value 1.038581e-12 0.6156338
## alternative "kurtosis is not equal to 3" "kurtosis is not equal to 3"
## method "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name "X[[i]]" "X[[i]]"
## Expend Grad.Rate
## statistic numeric,2 numeric,2
## p.value 1.453228e-45 0.2175176
## alternative "kurtosis is not equal to 3" "kurtosis is not equal to 3"
## method "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name "X[[i]]" "X[[i]]"
#Reject Normality assumption for all the varialbes, except variables named "Room.Board" and "Terminal" and "Grad.Rate" and "perc.alumni".
#conclusion:Reject Normality assumption for all the varialbes, except variable named "Grad.Rate".
library(corrplot)
cor_table<- round(cor(data[,c(3,4:19)]),3)
cor_table
## Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## Apps 1.000 0.943 0.847 0.339 0.352 0.814 0.398
## Accept 0.943 1.000 0.912 0.192 0.247 0.874 0.441
## Enroll 0.847 0.912 1.000 0.181 0.227 0.965 0.513
## Top10perc 0.339 0.192 0.181 1.000 0.892 0.141 -0.105
## Top25perc 0.352 0.247 0.227 0.892 1.000 0.199 -0.054
## F.Undergrad 0.814 0.874 0.965 0.141 0.199 1.000 0.571
## P.Undergrad 0.398 0.441 0.513 -0.105 -0.054 0.571 1.000
## Outstate 0.050 -0.026 -0.155 0.562 0.489 -0.216 -0.254
## Room.Board 0.165 0.091 -0.040 0.371 0.331 -0.069 -0.061
## Books 0.133 0.114 0.113 0.119 0.116 0.116 0.081
## Personal 0.179 0.201 0.281 -0.093 -0.081 0.317 0.320
## PhD 0.391 0.356 0.331 0.532 0.546 0.318 0.149
## Terminal 0.369 0.338 0.308 0.491 0.525 0.300 0.142
## S.F.Ratio 0.096 0.176 0.237 -0.385 -0.295 0.280 0.233
## perc.alumni -0.090 -0.160 -0.181 0.455 0.418 -0.229 -0.281
## Expend 0.260 0.125 0.064 0.661 0.527 0.019 -0.084
## Grad.Rate 0.147 0.067 -0.022 0.495 0.477 -0.079 -0.257
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## Apps 0.050 0.165 0.133 0.179 0.391 0.369 0.096
## Accept -0.026 0.091 0.114 0.201 0.356 0.338 0.176
## Enroll -0.155 -0.040 0.113 0.281 0.331 0.308 0.237
## Top10perc 0.562 0.371 0.119 -0.093 0.532 0.491 -0.385
## Top25perc 0.489 0.331 0.116 -0.081 0.546 0.525 -0.295
## F.Undergrad -0.216 -0.069 0.116 0.317 0.318 0.300 0.280
## P.Undergrad -0.254 -0.061 0.081 0.320 0.149 0.142 0.233
## Outstate 1.000 0.654 0.039 -0.299 0.383 0.408 -0.555
## Room.Board 0.654 1.000 0.128 -0.199 0.329 0.375 -0.363
## Books 0.039 0.128 1.000 0.179 0.027 0.100 -0.032
## Personal -0.299 -0.199 0.179 1.000 -0.011 -0.031 0.136
## PhD 0.383 0.329 0.027 -0.011 1.000 0.850 -0.131
## Terminal 0.408 0.375 0.100 -0.031 0.850 1.000 -0.160
## S.F.Ratio -0.555 -0.363 -0.032 0.136 -0.131 -0.160 1.000
## perc.alumni 0.566 0.272 -0.040 -0.286 0.249 0.267 -0.403
## Expend 0.673 0.502 0.112 -0.098 0.433 0.439 -0.584
## Grad.Rate 0.571 0.425 0.001 -0.269 0.305 0.290 -0.307
## perc.alumni Expend Grad.Rate
## Apps -0.090 0.260 0.147
## Accept -0.160 0.125 0.067
## Enroll -0.181 0.064 -0.022
## Top10perc 0.455 0.661 0.495
## Top25perc 0.418 0.527 0.477
## F.Undergrad -0.229 0.019 -0.079
## P.Undergrad -0.281 -0.084 -0.257
## Outstate 0.566 0.673 0.571
## Room.Board 0.272 0.502 0.425
## Books -0.040 0.112 0.001
## Personal -0.286 -0.098 -0.269
## PhD 0.249 0.433 0.305
## Terminal 0.267 0.439 0.290
## S.F.Ratio -0.403 -0.584 -0.307
## perc.alumni 1.000 0.418 0.491
## Expend 0.418 1.000 0.390
## Grad.Rate 0.491 0.390 1.000
corrplot(cor_table)
As can be seen, there is a high correlation between the variables: Accept, Enroll and F.Undergrad.
par(mar=c(2,2,2,2))
par(mfrow=c(4,4))#4 rows and 4 columns
for(i in c(4:19)){
plot(data[,i],data$Apps,main=paste("Scater Plot of Apps Vs.",names(data)[i]))
}
par(mar=c(2,2,2,2))
par(mfrow=c(4,4))#4 rows and 4 columns
for(i in c(4:19)){
boxplot(data[,i]~Private,data=data,main=paste("box Plot of",names(data)[i],"according to Type of university "))
}
par(mfrow=c(1,1))
Univariate Detection: 1. Apps
data_1<-data[,-1]
tukey_1<-quantile(data_1$Apps,probs=0.75)+1.5*IQR(data_1$Apps)
tukey_2<-quantile(data_1$Apps,probs=0.25)-1.5*IQR(data_1$Apps)
out_Apps_sum<-sum(data_1$Apps>tukey_1|data_1$Apps < tukey_2)
out_Apps_sum
## [1] 70
out_Apps_procent<-out_Apps_sum/nrow(data_1)*100
out_Apps_procent #9% data in Apps is outleire
## [1] 9.009009
which((data_1$Apps > tukey_1|data_1$Apps < tukey_2))
## [1] 24 60 62 71 88 119 142 159 175 177 192 204 222 251 270 275 278 280 285
## [20] 366 367 408 413 419 421 425 433 441 446 460 462 484 511 561 562 563 564 566
## [39] 570 571 577 582 606 607 612 615 620 621 624 625 627 634 635 638 641 650 652
## [58] 663 664 665 668 670 678 686 694 695 701 714 744 776
out_Apps<-data_1$Apps[data_1$Apps > tukey_1|data_1$Apps < tukey_2]
out_Apps
## [1] 12809 20192 9251 12586 8728 8065 9478 8587 13789 9274 8506 11651
## [13] 11115 13865 8681 16587 8427 11223 8474 9239 18114 13594 10634 11901
## [25] 10706 12289 11023 8256 19315 13218 21804 48094 9402 13528 14463 15039
## [37] 12512 8000 8598 8399 10477 14474 19873 15698 9735 14446 12445 11220
## [49] 14939 8384 8579 14292 14438 19152 11054 9750 14596 8631 12394 8586
## [61] 9643 8766 12229 14752 15849 12749 14901 15712 9167 10705
1. other variables
+**show which data in different variable is Outlier:**
for(i in c(3:18) ){
print(names(data_1)[i])
tukey_1i<-quantile(data_1[,i],probs=0.75)+1.5*IQR(data_1[,i])
tukey_2i<-quantile(data_1[,i],probs=0.25)-1.5*IQR(data_1[,i])
out_Apps_sumi<-sum(data_1[,i]>tukey_1i|data_1[,i] < tukey_2i)
print(out_Apps_sumi)
out_Apps_procenti<-out_Apps_sumi/nrow(data_1)*100
print(out_Apps_procenti)
print(which(data_1[,i]>tukey_1i|data_1[,i] < tukey_2i))
}
## [1] "Accept"
## [1] 73
## [1] 9.395109
## [1] 24 28 40 60 62 70 88 119 142 177 204 258 270 275 278 279 280 366 367
## [20] 408 413 419 421 425 433 446 462 484 511 537 561 562 563 564 577 582 606 607
## [39] 611 612 614 615 620 621 624 625 627 634 635 637 638 641 648 650 652 663 664
## [58] 665 668 670 676 678 684 686 693 694 695 701 711 714 728 729 744
## [1] "Enroll"
## [1] 79
## [1] 10.16731
## [1] 22 24 28 40 60 62 70 119 142 177 204 223 270 274 275 278 280 289 325
## [20] 346 366 367 408 413 419 420 421 425 433 437 446 462 484 490 511 537 563 577
## [39] 582 586 605 606 607 608 611 612 615 620 621 624 625 627 634 635 638 641 643
## [58] 648 650 652 658 661 662 663 664 665 668 676 678 684 686 692 694 695 701 702
## [77] 714 728 744
## [1] "Top10perc"
## [1] 39
## [1] 5.019305
## [1] 17 55 61 71 72 87 92 115 138 139 145 159 160 175 192 222 223 251 252
## [20] 285 355 408 425 447 460 606 607 610 638 652 661 664 694 709 721 726 734 764
## [39] 776
## [1] "Top25perc"
## [1] 0
## [1] 0
## integer(0)
## [1] "F.Undergrad"
## [1] 97
## [1] 12.48391
## [1] 22 24 28 40 60 62 70 79 80 119 142 177 182 202 204 219 223 270 274
## [20] 275 278 280 281 289 325 341 366 367 376 383 408 413 419 420 421 433 437 446
## [39] 462 484 490 511 537 561 562 563 564 577 582 605 606 607 608 611 612 615 620
## [58] 621 623 624 625 627 629 634 635 638 641 643 648 650 652 653 658 660 662 663
## [77] 664 665 668 676 677 678 681 684 685 686 687 692 694 695 701 702 712 714 728
## [96] 744 747
## [1] "P.Undergrad"
## [1] 67
## [1] 8.622909
## [1] 24 39 60 103 143 171 178 202 204 219 224 233 234 275 304 325 330 346 367
## [20] 384 408 413 419 420 428 441 462 483 484 511 534 537 563 582 586 604 608 615
## [39] 620 621 623 625 629 633 634 641 645 648 653 657 658 662 665 668 676 677 680
## [58] 684 685 686 687 692 695 696 702 712 744
## [1] "Outstate"
## [1] 1
## [1] 0.1287001
## [1] 48
## [1] "Room.Board"
## [1] 7
## [1] 0.9009009
## [1] 38 347 408 415 419 516 664
## [1] "Books"
## [1] 46
## [1] 5.920206
## [1] 5 19 22 35 61 64 70 73 101 119 134 182 202 258 302 318 320 321 368
## [20] 369 374 378 388 390 397 422 437 455 465 472 481 483 498 514 531 534 574 579
## [39] 649 688 691 692 728 736 738 742
## [1] "Personal"
## [1] 20
## [1] 2.574003
## [1] 190 202 224 296 299 318 369 431 437 474 498 581 625 645 662 684 686 692 712
## [20] 763
## [1] "PhD"
## [1] 8
## [1] 1.029601
## [1] 86 96 101 227 514 609 688 736
## [1] "Terminal"
## [1] 8
## [1] 1.029601
## [1] 2 227 265 342 369 395 507 740
## [1] "S.F.Ratio"
## [1] 12
## [1] 1.544402
## [1] 92 227 276 285 312 318 364 495 609 687 729 744
## [1] "perc.alumni"
## [1] 5
## [1] 0.6435006
## [1] 17 87 107 243 764
## [1] "Expend"
## [1] 48
## [1] 6.177606
## [1] 4 17 21 61 65 71 72 87 88 92 115 145 153 159 160 175 192 222 238
## [20] 243 251 252 285 355 391 408 425 460 498 516 529 576 594 610 637 664 670 678
## [39] 690 709 710 721 729 734 736 754 764 776
## [1] "Grad.Rate"
## [1] 4
## [1] 0.5148005
## [1] 5 96 385 586
max(out_Apps)
## [1] 48094
data$College.Name[484]
## [1] "Rutgers at New Brunswick"
# We expect this data to be high leverage.
multivariate detection
data_2<-data[,3:19]
mah_d2<-mahalanobis(data_2,colMeans(data[,3:19]),cov(data_2))
out_All<-mah_d2[mah_d2>10]
length(out_All)/length(mah_d2)*100
## [1] 56.49936
56% all of data set have the mahalanobis distance >10
out_All<-mah_d2[mah_d2>20]
length(out_All)/length(mah_d2)*100
## [1] 19.94852
19.94% all of data set have the mahalanobis distance >20
out_All<-mah_d2[mah_d2>30]
length(out_All)/length(mah_d2)*100
## [1] 11.19691
11.19% all of data set have the mahalanobis distance >30
data #484 besides Apps is also outlier in variables
P.Undergrad,
f.undergrad,
accept,
Enroll.
Data #484 probably to have been entered incorrectly. so for improving result, we removed it from the data.
data<-data[-which(row.names(data)==484),]
#Apps (dependent variable with out data #484 )
summary(data$Apps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 81 776 1558 2944 3603 21804
#distribution Apps with out data #484
#1.histogram
hist(data$Apps,xlab="Number of applications received",main=paste("Histogram of", "App"))
#2.boxplot
boxplot(data$Apps,main="box Plot App")
#3.QQ-plot
qqnorm(data$Apps, main="qqplot of applications received",pch=20)
qqline(data$Apps,col="red")
#4Jarque-Bera Test(Skewness=0) and Anscombe-glynn Test(kurtosis=3))
library(moments)
jarque.test(data$Apps)
##
## Jarque-Bera Normality Test
##
## data: data$Apps
## JB = 1722.4, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(data$Apps)
##
## Anscombe-Glynn kurtosis test
##
## data: data$Apps
## kurt = 8.6878, z = 10.1187, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#conclusion:Despite the improved results, Reject normality assumption for Apps distribution.
Remove college name
data_1<-data[,-1]
Divide Dataset into Train and Test
set.seed(123456)
train_cases <- sample(1:nrow(data_1), nrow(data_1) * 0.7)
train <- data_1[train_cases,]
test <- data_1[- train_cases,]
summary(train)
## Private Apps Accept Enroll
## Private :390 Min. : 81.0 Min. : 72.0 Min. : 35.0
## Not Private:153 1st Qu.: 738.5 1st Qu.: 578.5 1st Qu.: 227.5
## Median : 1603.0 Median : 1101.0 Median : 428.0
## Mean : 2983.8 Mean : 1998.0 Mean : 768.2
## 3rd Qu.: 3780.5 3rd Qu.: 2525.5 3rd Qu.: 902.5
## Max. :21804.0 Max. :18744.0 Max. :6180.0
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 13.00 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 40.00 1st Qu.: 962 1st Qu.: 107.5
## Median :23.00 Median : 53.00 Median : 1688 Median : 370.0
## Mean :26.87 Mean : 54.82 Mean : 3662 Mean : 886.4
## 3rd Qu.:34.00 3rd Qu.: 67.00 3rd Qu.: 4056 3rd Qu.: 987.5
## Max. :95.00 Max. :100.00 Max. :28938 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 110.0 Min. : 300
## 1st Qu.: 7366 1st Qu.:3613 1st Qu.: 475.5 1st Qu.: 900
## Median : 9990 Median :4218 Median : 500.0 Median :1220
## Mean :10399 Mean :4377 Mean : 548.6 Mean :1357
## 3rd Qu.:12838 3rd Qu.:5075 3rd Qu.: 600.0 3rd Qu.:1658
## Max. :19960 Max. :7400 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.90 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 70.5 1st Qu.:11.40 1st Qu.:13.00
## Median : 75.00 Median : 81.0 Median :13.50 Median :21.00
## Mean : 72.36 Mean : 79.3 Mean :14.06 Mean :22.61
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :63.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10
## 1st Qu.: 6682 1st Qu.: 53
## Median : 8408 Median : 65
## Mean : 9708 Mean : 65
## 3rd Qu.:10880 3rd Qu.: 78
## Max. :56233 Max. :118
summary(test)
## Private Apps Accept Enroll
## Private :175 Min. : 174 Min. : 146 Min. : 75.0
## Not Private: 58 1st Qu.: 805 1st Qu.: 638 1st Qu.: 274.0
## Median : 1538 Median : 1141 Median : 452.0
## Mean : 2850 Mean : 1963 Mean : 791.3
## 3rd Qu.: 3540 3rd Qu.: 2140 3rd Qu.: 871.0
## Max. :16587 Max. :13243 Max. :6392.0
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.00 Min. : 316 Min. : 1.0
## 1st Qu.:18.00 1st Qu.:45.00 1st Qu.: 1058 1st Qu.: 75.0
## Median :25.00 Median :57.00 Median : 1737 Median : 299.0
## Mean :29.14 Mean :57.98 Mean : 3713 Mean : 770.5
## 3rd Qu.:37.00 3rd Qu.:71.00 3rd Qu.: 3796 3rd Qu.: 911.0
## Max. :96.00 Max. :99.00 Max. :31643 Max. :10221.0
## Outstate Room.Board Books Personal
## Min. : 2700 Min. :2217 Min. : 96.0 Min. : 250
## 1st Qu.: 7260 1st Qu.:3518 1st Qu.: 450.0 1st Qu.: 800
## Median : 9900 Median :4124 Median : 500.0 Median :1200
## Mean :10552 Mean :4311 Mean : 550.6 Mean :1300
## 3rd Qu.:13240 3rd Qu.:5015 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2000.0 Max. :4913
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 10.00 Min. : 33.00 Min. : 2.50 Min. : 2.00
## 1st Qu.: 63.00 1st Qu.: 71.00 1st Qu.:11.70 1st Qu.:13.00
## Median : 76.00 Median : 84.00 Median :13.70 Median :21.00
## Mean : 73.27 Mean : 80.58 Mean :14.14 Mean :23.06
## 3rd Qu.: 86.00 3rd Qu.: 93.00 3rd Qu.:16.40 3rd Qu.:31.00
## Max. :100.00 Max. :100.00 Max. :28.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3365 Min. : 15.0
## 1st Qu.: 6898 1st Qu.: 56.0
## Median : 8243 Median : 66.0
## Mean : 9544 Mean : 66.5
## 3rd Qu.:10458 3rd Qu.: 77.0
## Max. :33541 Max. :100.0
dim(train)
## [1] 543 18
dim(test)
## [1] 233 18
Traditional_lm_1 <- lm(Apps ~ ., data= train)
summary(Traditional_lm_1)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3379.6 -407.7 -55.2 345.2 6779.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.409e+03 4.377e+02 -3.218 0.001369 **
## PrivateNot Private 5.240e+02 1.629e+02 3.216 0.001379 **
## Accept 1.273e+00 5.973e-02 21.305 < 2e-16 ***
## Enroll -2.995e-01 2.303e-01 -1.300 0.194053
## Top10perc 5.114e+01 6.330e+00 8.079 4.53e-15 ***
## Top25perc -1.267e+01 5.050e+00 -2.510 0.012385 *
## F.Undergrad 8.808e-02 3.545e-02 2.484 0.013289 *
## P.Undergrad 3.464e-02 3.508e-02 0.988 0.323845
## Outstate -5.346e-02 2.238e-02 -2.389 0.017236 *
## Room.Board 1.922e-01 5.689e-02 3.379 0.000783 ***
## Books 2.132e-01 2.889e-01 0.738 0.460885
## Personal -2.472e-02 7.100e-02 -0.348 0.727882
## PhD -1.026e+01 5.321e+00 -1.929 0.054264 .
## Terminal 6.484e-01 5.697e+00 0.114 0.909430
## S.F.Ratio 1.443e+01 1.461e+01 0.988 0.323847
## perc.alumni -5.536e+00 4.723e+00 -1.172 0.241590
## Expend 5.757e-02 1.307e-02 4.405 1.28e-05 ***
## Grad.Rate 9.018e+00 3.342e+00 2.698 0.007193 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1001 on 525 degrees of freedom
## Multiple R-squared: 0.9256, Adjusted R-squared: 0.9232
## F-statistic: 384.2 on 17 and 525 DF, p-value: < 2.2e-16
#R-squared & Adjusted R-squared are close to 1 and 92.32% of AppS data is explained by other variables.
#Reject H0 for F-test(There is atleast one linear relationship )
# T- test result: variables:Accept Assumption of H0 for variables Enroll,P.Undergrad & Books & Personal & Terminal
#& S.F.Ratio & perc.alumni are not in model
# T- test result for PhD reject hardly so we keep it in the Model.
#Delete insignificant variables
Traditional_lm_2 <- lm(Apps ~ .-Enroll-P.Undergrad-Books-Personal-Terminal-S.F.Ratio-perc.alumni,data= train)
summary(Traditional_lm_2)
##
## Call:
## lm(formula = Apps ~ . - Enroll - P.Undergrad - Books - Personal -
## Terminal - S.F.Ratio - perc.alumni, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3395.7 -418.6 -27.8 338.3 6653.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.181e+03 2.995e+02 -3.942 9.17e-05 ***
## PrivateNot Private 5.833e+02 1.586e+02 3.679 0.000258 ***
## Accept 1.226e+00 4.440e-02 27.610 < 2e-16 ***
## Top10perc 4.893e+01 6.139e+00 7.971 9.63e-15 ***
## Top25perc -1.221e+01 4.944e+00 -2.470 0.013842 *
## F.Undergrad 6.411e-02 2.347e-02 2.731 0.006520 **
## Outstate -5.938e-02 2.171e-02 -2.736 0.006431 **
## Room.Board 2.303e-01 5.388e-02 4.274 2.28e-05 ***
## PhD -9.936e+00 3.638e+00 -2.731 0.006524 **
## Expend 5.218e-02 1.205e-02 4.330 1.79e-05 ***
## Grad.Rate 7.587e+00 3.208e+00 2.365 0.018384 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1001 on 532 degrees of freedom
## Multiple R-squared: 0.9247, Adjusted R-squared: 0.9233
## F-statistic: 653.1 on 10 and 532 DF, p-value: < 2.2e-16
#R-squared & Adjusted R-squared are close to 1 so 92.33% of AppS data is explained by other variables.
#Reject H0 for F-test(There is atleast one linear relationship)
#Reject H0 forT- testt:Reject Assumption of H0 for all variables.
#Check Assumptions of Regression for Model Traditional_lm_2
#Normality of residuals
#1.Histogram
hist(Traditional_lm_2$residuals, probability = TRUE, main = "Histogram of residuals in Model 'Traditional_lm_2'", breaks = 25)
lines(density(Traditional_lm_2$residuals), col = "red")
#2.QQ-plot
qqnorm(Traditional_lm_2$residuals, main = "QQ Plot of residuals of Model 'Traditional_lm_2'", pch = 20)
qqline(Traditional_lm_2$residuals, col = "red")
#It has serious deviations from the 45 degree line, which is not a small number. Distribution is probably not normal.
#3.Test for Skewness and Kurtosis
jarque.test(Traditional_lm_2$residuals)
##
## Jarque-Bera Normality Test
##
## data: Traditional_lm_2$residuals
## JB = 2965.5, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_2$residuals)
##
## Anscombe-Glynn kurtosis test
##
## data: Traditional_lm_2$residuals
## kurt = 13.759, z = 10.557, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
#Cook's distance > 1
which(cooks.distance(Traditional_lm_2)>1)
## named integer(0)
#Diagnostic Plots
plot(Traditional_lm_2)
#Residuals Vs Fitted shows Pattern so we probablity have collinearity problem
#Q-Q plot shows It has serious deviations from the 45 degree line,so the observation of 251-694-71 probably cause Distribution not normal
#Standard error root vs Fitted(y^) shows Pattern
#Residuals Vs leverage shows the outliers using Cook's distance.that measure the effect of existing the observations on the model.Usually sizes above 1 cause trouble and should be removed.
#Check multicollinearity
car :: vif(Traditional_lm_2)
## Private Accept Top10perc Top25perc F.Undergrad Outstate
## 2.759054 5.697257 6.349038 5.181526 6.485246 4.047322
## Room.Board PhD Expend Grad.Rate
## 1.866515 1.891698 2.342980 1.684368
#no multicollinearity problem.
#Conclusion: severe violation of regression assumption
#Bad model!
Test the Model Traditional_lm_2
#Prediction
pred_Tlm <- predict(Traditional_lm_2, test)
head(pred_Tlm ,5)
## 8 13 15 18 20
## 2566.7247 1831.8215 677.0293 1088.8244 3032.5750
#Absolute error mean, median, sd, max, min-------
abs_err_Tlm <- abs(pred_Tlm - test$Apps)
mean(abs_err_Tlm)
## [1] 599.7741
median(abs_err_Tlm)
## [1] 375.7187
sd(abs_err_Tlm)
## [1] 734.9941
range(abs_err_Tlm)
## [1] 2.019535 6893.422430
#histogram and boxplot
hist(abs_err_Tlm, breaks = 15)
boxplot(abs_err_Tlm)
#Actual vs. Predicted
plot(test$Apps, pred_Tlm , xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
Model2:Step-by-step we can use step by step removed outliers here and then model it
#1-1-------------
#remove case #251-694-71 .
train2<- train[-which(rownames(train) %in% c(251,694,71)),]
dim(train2)
## [1] 540 18
dim(train)
## [1] 543 18
Traditional_lm_3 <- lm(Apps ~ ., data= train2)
summary(Traditional_lm_3)
##
## Call:
## lm(formula = Apps ~ ., data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3600.0 -351.1 -22.6 267.8 5905.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.222e+03 3.842e+02 -3.180 0.001562 **
## PrivateNot Private 4.083e+02 1.428e+02 2.859 0.004426 **
## Accept 1.381e+00 5.288e-02 26.107 < 2e-16 ***
## Enroll -6.699e-01 2.033e-01 -3.295 0.001052 **
## Top10perc 4.128e+01 5.598e+00 7.375 6.51e-13 ***
## Top25perc -7.023e+00 4.439e+00 -1.582 0.114228
## F.Undergrad 1.030e-01 3.100e-02 3.324 0.000950 ***
## P.Undergrad 5.393e-02 3.069e-02 1.758 0.079410 .
## Outstate -6.139e-02 1.959e-02 -3.133 0.001825 **
## Room.Board 1.848e-01 4.978e-02 3.713 0.000227 ***
## Books 3.067e-01 2.529e-01 1.213 0.225865
## Personal -3.492e-02 6.216e-02 -0.562 0.574568
## PhD -8.996e+00 4.649e+00 -1.935 0.053505 .
## Terminal 1.374e+00 4.977e+00 0.276 0.782598
## S.F.Ratio 1.078e+01 1.287e+01 0.838 0.402521
## perc.alumni -4.002e+00 4.138e+00 -0.967 0.333891
## Expend 4.655e-02 1.165e-02 3.995 7.40e-05 ***
## Grad.Rate 6.109e+00 2.929e+00 2.086 0.037506 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 874.3 on 522 degrees of freedom
## Multiple R-squared: 0.9404, Adjusted R-squared: 0.9384
## F-statistic: 484.4 on 17 and 522 DF, p-value: < 2.2e-16
Traditional_lm_4 <- lm(Apps ~ .-Top25perc-Books-Personal-Terminal-S.F.Ratio-perc.alumni,data= train2)
summary(Traditional_lm_4)
##
## Call:
## lm(formula = Apps ~ . - Top25perc - Books - Personal - Terminal -
## S.F.Ratio - perc.alumni, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3610.0 -371.7 -25.3 270.8 6001.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.071e+03 2.522e+02 -4.247 2.56e-05 ***
## PrivateNot Private 4.346e+02 1.391e+02 3.123 0.001888 **
## Accept 1.388e+00 5.154e-02 26.940 < 2e-16 ***
## Enroll -6.746e-01 1.999e-01 -3.374 0.000795 ***
## Top10perc 3.427e+01 3.357e+00 10.209 < 2e-16 ***
## F.Undergrad 1.014e-01 3.064e-02 3.309 0.001000 ***
## P.Undergrad 5.466e-02 3.044e-02 1.795 0.073174 .
## Outstate -6.948e-02 1.899e-02 -3.658 0.000280 ***
## Room.Board 2.060e-01 4.799e-02 4.292 2.11e-05 ***
## PhD -9.016e+00 3.182e+00 -2.833 0.004787 **
## Expend 4.549e-02 1.055e-02 4.311 1.94e-05 ***
## Grad.Rate 5.069e+00 2.838e+00 1.786 0.074651 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 874.4 on 528 degrees of freedom
## Multiple R-squared: 0.9397, Adjusted R-squared: 0.9384
## F-statistic: 747.8 on 11 and 528 DF, p-value: < 2.2e-16
#Check Assumptions of Regression for Model Traditional_lm_4
#Diagnostic Plots
plot(Traditional_lm_4)
#Test for Skewness and Kurtosis
jarque.test(Traditional_lm_4$residuals)
##
## Jarque-Bera Normality Test
##
## data: Traditional_lm_4$residuals
## JB = 3028, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_4$residuals)
##
## Anscombe-Glynn kurtosis test
##
## data: Traditional_lm_4$residuals
## kurt = 13.964, z = 10.586, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
# outliers of 606,175 probably cause Distribution not normal
#1-2---------------------
#remove case #175,606 .
train2<- train[-which(rownames(train) %in% c(251,694,71,175,606)),]
dim(train2)
## [1] 538 18
Traditional_lm_3 <- lm(Apps ~ ., data= train2)
summary(Traditional_lm_3)
##
## Call:
## lm(formula = Apps ~ ., data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3513.6 -334.2 -33.8 239.9 4840.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.325e+03 3.516e+02 -3.770 0.000182 ***
## PrivateNot Private 3.769e+02 1.307e+02 2.884 0.004085 **
## Accept 1.387e+00 4.838e-02 28.674 < 2e-16 ***
## Enroll -6.184e-01 1.867e-01 -3.313 0.000988 ***
## Top10perc 2.829e+01 5.300e+00 5.337 1.41e-07 ***
## Top25perc -1.452e-01 4.121e+00 -0.035 0.971910
## F.Undergrad 8.163e-02 2.859e-02 2.856 0.004467 **
## P.Undergrad 6.308e-02 2.808e-02 2.246 0.025124 *
## Outstate -5.991e-02 1.792e-02 -3.344 0.000887 ***
## Room.Board 1.564e-01 4.568e-02 3.424 0.000666 ***
## Books 3.399e-01 2.314e-01 1.469 0.142378
## Personal -4.073e-02 5.686e-02 -0.716 0.474047
## PhD -7.066e+00 4.258e+00 -1.659 0.097625 .
## Terminal 1.197e+00 4.552e+00 0.263 0.792736
## S.F.Ratio 1.203e+01 1.177e+01 1.022 0.307387
## perc.alumni -3.079e+00 3.788e+00 -0.813 0.416588
## Expend 4.985e-02 1.070e-02 4.658 4.06e-06 ***
## Grad.Rate 5.799e+00 2.679e+00 2.164 0.030900 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 799.6 on 520 degrees of freedom
## Multiple R-squared: 0.9471, Adjusted R-squared: 0.9454
## F-statistic: 547.8 on 17 and 520 DF, p-value: < 2.2e-16
Traditional_lm_4 <- lm(Apps ~ .-Top25perc-Books-Personal-Terminal-S.F.Ratio-perc.alumni,data= train2)
summary(Traditional_lm_4)
##
## Call:
## lm(formula = Apps ~ . - Top25perc - Books - Personal - Terminal -
## S.F.Ratio - perc.alumni, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3532.8 -333.8 -41.0 221.8 4790.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.007e+03 2.305e+02 -4.372 1.49e-05 ***
## PrivateNot Private 4.116e+02 1.271e+02 3.239 0.001277 **
## Accept 1.397e+00 4.710e-02 29.659 < 2e-16 ***
## Enroll -6.402e-01 1.833e-01 -3.493 0.000519 ***
## Top10perc 2.835e+01 3.129e+00 9.060 < 2e-16 ***
## F.Undergrad 8.344e-02 2.820e-02 2.959 0.003225 **
## P.Undergrad 6.314e-02 2.782e-02 2.270 0.023635 *
## Outstate -6.524e-02 1.735e-02 -3.760 0.000189 ***
## Room.Board 1.771e-01 4.399e-02 4.025 6.55e-05 ***
## PhD -6.728e+00 2.916e+00 -2.307 0.021446 *
## Expend 4.541e-02 9.677e-03 4.693 3.44e-06 ***
## Grad.Rate 5.228e+00 2.593e+00 2.017 0.044250 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 798.6 on 526 degrees of freedom
## Multiple R-squared: 0.9466, Adjusted R-squared: 0.9455
## F-statistic: 848.3 on 11 and 526 DF, p-value: < 2.2e-16
#Check Assumptions of Regression for Model Traditional_lm_4
#Diagnostic Plots
plot(Traditional_lm_4)
#Test for Skewness and Kurtosis
jarque.test(Traditional_lm_4$residuals)
##
## Jarque-Bera Normality Test
##
## data: Traditional_lm_4$residuals
## JB = 2303.9, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_4$residuals)
##
## Anscombe-Glynn kurtosis test
##
## data: Traditional_lm_4$residuals
## kurt = 12.544, z = 10.165, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
# outliers of 562 probably cause Distribution not normal
#1-3-----------
#remove case #562 .
train2<- train[-which(rownames(train) %in% c(251,694,71,175,606,562)),]
dim(train2)
## [1] 537 18
Traditional_lm_3 <- lm(Apps ~ ., data= train2)
summary(Traditional_lm_3)
##
## Call:
## lm(formula = Apps ~ ., data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3369.2 -318.2 -35.7 215.0 4906.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.309e+03 3.408e+02 -3.839 0.000139 ***
## PrivateNot Private 3.638e+02 1.267e+02 2.871 0.004256 **
## Accept 1.356e+00 4.720e-02 28.727 < 2e-16 ***
## Enroll -5.500e-01 1.813e-01 -3.033 0.002541 **
## Top10perc 2.569e+01 5.157e+00 4.981 8.64e-07 ***
## Top25perc 6.500e-04 3.995e+00 0.000 0.999870
## F.Undergrad 8.456e-02 2.772e-02 3.051 0.002397 **
## P.Undergrad 6.219e-02 2.723e-02 2.284 0.022757 *
## Outstate -5.062e-02 1.744e-02 -2.902 0.003868 **
## Room.Board 1.525e-01 4.429e-02 3.444 0.000620 ***
## Books 3.201e-01 2.243e-01 1.427 0.154239
## Personal -2.998e-02 5.515e-02 -0.544 0.586909
## PhD -5.523e+00 4.136e+00 -1.335 0.182315
## Terminal -7.204e-01 4.425e+00 -0.163 0.870742
## S.F.Ratio 1.137e+01 1.142e+01 0.996 0.319744
## perc.alumni -2.725e+00 3.672e+00 -0.742 0.458464
## Expend 5.223e-02 1.038e-02 5.031 6.75e-07 ***
## Grad.Rate 5.411e+00 2.598e+00 2.083 0.037785 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 775.1 on 519 degrees of freedom
## Multiple R-squared: 0.9493, Adjusted R-squared: 0.9477
## F-statistic: 571.8 on 17 and 519 DF, p-value: < 2.2e-16
Traditional_lm_4 <- lm(Apps ~ .-Top25perc-Books-Personal-PhD-Terminal-S.F.Ratio-perc.alumni,data= train2)
summary(Traditional_lm_4)
##
## Call:
## lm(formula = Apps ~ . - Top25perc - Books - Personal - PhD -
## Terminal - S.F.Ratio - perc.alumni, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3372.4 -336.1 -19.9 227.8 4917.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.242e+03 2.028e+02 -6.127 1.76e-09 ***
## PrivateNot Private 3.002e+02 1.172e+02 2.562 0.010697 *
## Accept 1.366e+00 4.612e-02 29.623 < 2e-16 ***
## Enroll -5.665e-01 1.787e-01 -3.171 0.001609 **
## Top10perc 2.396e+01 2.951e+00 8.121 3.32e-15 ***
## F.Undergrad 8.323e-02 2.740e-02 3.038 0.002502 **
## P.Undergrad 5.776e-02 2.697e-02 2.142 0.032656 *
## Outstate -6.564e-02 1.643e-02 -3.996 7.37e-05 ***
## Room.Board 1.579e-01 4.256e-02 3.710 0.000229 ***
## Expend 4.698e-02 9.413e-03 4.991 8.19e-07 ***
## Grad.Rate 4.770e+00 2.522e+00 1.892 0.059102 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 776.8 on 526 degrees of freedom
## Multiple R-squared: 0.9484, Adjusted R-squared: 0.9474
## F-statistic: 966.9 on 10 and 526 DF, p-value: < 2.2e-16
#Check Assumptions of Regression for Model Traditional_lm_4
#Diagnostic Plots
plot(Traditional_lm_4)
#Test for Skewness and Kurtosis
jarque.test(Traditional_lm_4$residuals)
##
## Jarque-Bera Normality Test
##
## data: Traditional_lm_4$residuals
## JB = 2357.4, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_4$residuals)
##
## Anscombe-Glynn kurtosis test
##
## data: Traditional_lm_4$residuals
## kurt = 12.686, z = 10.200, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
# outliers of 446 probably cause Distribution not normal
#Residuales Distribution is far from the normal distribution.---
#1-4-----------
#remove case #446 .
train2<- train[-which(rownames(train) %in% c(251,694,71,175,606,562,446)),]
dim(train2)
## [1] 536 18
Traditional_lm_3 <- lm(Apps ~ ., data= train2)
summary(Traditional_lm_3)
##
## Call:
## lm(formula = Apps ~ ., data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3230.4 -329.2 -27.1 232.2 4848.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.384e+03 3.317e+02 -4.173 3.52e-05 ***
## PrivateNot Private 3.981e+02 1.233e+02 3.228 0.00133 **
## Accept 1.326e+00 4.620e-02 28.709 < 2e-16 ***
## Enroll -1.814e-01 1.883e-01 -0.963 0.33590
## Top10perc 2.476e+01 5.017e+00 4.936 1.08e-06 ***
## Top25perc -1.165e-01 3.885e+00 -0.030 0.97609
## F.Undergrad 1.234e-02 2.991e-02 0.412 0.68018
## P.Undergrad 8.012e-02 2.667e-02 3.004 0.00279 **
## Outstate -5.595e-02 1.699e-02 -3.294 0.00106 **
## Room.Board 1.647e-01 4.312e-02 3.820 0.00015 ***
## Books 3.708e-01 2.183e-01 1.698 0.09003 .
## Personal -3.881e-02 5.365e-02 -0.724 0.46968
## PhD -3.055e+00 4.046e+00 -0.755 0.45049
## Terminal -2.169e+00 4.311e+00 -0.503 0.61503
## S.F.Ratio 1.145e+01 1.110e+01 1.032 0.30253
## perc.alumni -3.717e+00 3.575e+00 -1.040 0.29894
## Expend 5.458e-02 1.010e-02 5.402 1.00e-07 ***
## Grad.Rate 5.920e+00 2.528e+00 2.342 0.01958 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 753.7 on 518 degrees of freedom
## Multiple R-squared: 0.95, Adjusted R-squared: 0.9483
## F-statistic: 578.5 on 17 and 518 DF, p-value: < 2.2e-16
Traditional_lm_4 <- lm(Apps ~ .-Enroll-Top25perc-F.Undergrad-Personal-PhD-Terminal-S.F.Ratio-perc.alumni,data= train2)
summary(Traditional_lm_4)
##
## Call:
## lm(formula = Apps ~ . - Enroll - Top25perc - F.Undergrad - Personal -
## PhD - Terminal - S.F.Ratio - perc.alumni, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3149.2 -335.6 -28.2 234.1 4735.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.533e+03 2.144e+02 -7.153 2.86e-12 ***
## PrivateNot Private 3.493e+02 1.120e+02 3.120 0.00191 **
## Accept 1.289e+00 1.897e-02 67.943 < 2e-16 ***
## Top10perc 2.192e+01 2.772e+00 7.909 1.54e-14 ***
## P.Undergrad 6.870e-02 2.419e-02 2.840 0.00468 **
## Outstate -6.483e-02 1.572e-02 -4.123 4.34e-05 ***
## Room.Board 1.728e-01 4.116e-02 4.197 3.17e-05 ***
## Books 3.640e-01 2.116e-01 1.720 0.08596 .
## Expend 4.882e-02 9.145e-03 5.338 1.40e-07 ***
## Grad.Rate 5.476e+00 2.444e+00 2.240 0.02548 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 753.7 on 526 degrees of freedom
## Multiple R-squared: 0.9492, Adjusted R-squared: 0.9483
## F-statistic: 1092 on 9 and 526 DF, p-value: < 2.2e-16
#Check Assumptions of Regression for Model Traditional_lm_4
#Diagnostic Plots
plot(Traditional_lm_4)
#Test for Skewness and Kurtosis
jarque.test(Traditional_lm_4$residuals)
##
## Jarque-Bera Normality Test
##
## data: Traditional_lm_4$residuals
## JB = 2122.7, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_4$residuals)
##
## Anscombe-Glynn kurtosis test
##
## data: Traditional_lm_4$residuals
## kurt = 12.208, z = 10.042, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
#Residuales Distribution is far from the normal distribution.
trimmed_train <- train[- which(train$Apps > tukey_1), ]
dim(trimmed_train)
## [1] 493 18
dim(train)
## [1] 543 18
summary(trimmed_train$Apps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 81 692 1321 2038 2925 7888
(dim(train)-dim(trimmed_train))/dim(train)*100
## [1] 9.208103 0.000000
#9% data are outliers
Traditional_lm_trimmed_train <- lm(Apps ~ ., data= trimmed_train)
summary(Traditional_lm_trimmed_train)
##
## Call:
## lm(formula = Apps ~ ., data = trimmed_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1554.63 -261.13 -27.51 187.94 3005.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.318e+03 2.321e+02 -5.682 2.33e-08 ***
## PrivateNot Private 3.450e+02 8.795e+01 3.922 0.000101 ***
## Accept 1.427e+00 4.761e-02 29.969 < 2e-16 ***
## Enroll -6.476e-01 1.610e-01 -4.023 6.68e-05 ***
## Top10perc 1.299e+01 3.688e+00 3.523 0.000469 ***
## Top25perc 1.836e+00 2.786e+00 0.659 0.510207
## F.Undergrad 4.931e-02 2.470e-02 1.997 0.046441 *
## P.Undergrad 5.667e-02 2.487e-02 2.279 0.023133 *
## Outstate -2.136e-02 1.201e-02 -1.779 0.075821 .
## Room.Board 6.349e-02 3.023e-02 2.100 0.036214 *
## Books 5.982e-01 1.483e-01 4.032 6.44e-05 ***
## Personal -4.418e-02 3.714e-02 -1.190 0.234806
## PhD 5.138e-01 2.738e+00 0.188 0.851218
## Terminal -1.380e+00 2.904e+00 -0.475 0.634834
## S.F.Ratio 2.100e+01 7.819e+00 2.686 0.007492 **
## perc.alumni 8.709e-01 2.469e+00 0.353 0.724439
## Expend 2.413e-02 7.932e-03 3.042 0.002483 **
## Grad.Rate 3.077e+00 1.750e+00 1.758 0.079368 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 502.1 on 475 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9245
## F-statistic: 355.4 on 17 and 475 DF, p-value: < 2.2e-16
#Delete insignificant variables
Traditional_lm_trimmed_train_2 <- lm(Apps ~ .-Top25perc-Personal-PhD-Terminal-perc.alumni,data=trimmed_train)
summary(Traditional_lm_trimmed_train_2)
##
## Call:
## lm(formula = Apps ~ . - Top25perc - Personal - PhD - Terminal -
## perc.alumni, data = trimmed_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1553.41 -274.16 -21.81 177.43 3032.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.391e+03 2.022e+02 -6.880 1.88e-11 ***
## PrivateNot Private 3.362e+02 8.353e+01 4.024 6.64e-05 ***
## Accept 1.422e+00 4.651e-02 30.573 < 2e-16 ***
## Enroll -6.425e-01 1.581e-01 -4.063 5.65e-05 ***
## Top10perc 1.502e+01 2.064e+00 7.276 1.41e-12 ***
## F.Undergrad 4.888e-02 2.447e-02 1.997 0.04636 *
## P.Undergrad 5.027e-02 2.434e-02 2.065 0.03949 *
## Outstate -1.977e-02 1.138e-02 -1.737 0.08306 .
## Room.Board 6.359e-02 2.934e-02 2.168 0.03069 *
## Books 5.676e-01 1.437e-01 3.951 8.94e-05 ***
## S.F.Ratio 2.165e+01 7.682e+00 2.818 0.00503 **
## Expend 2.309e-02 7.772e-03 2.971 0.00312 **
## Grad.Rate 3.406e+00 1.698e+00 2.006 0.04541 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 500.7 on 480 degrees of freedom
## Multiple R-squared: 0.9268, Adjusted R-squared: 0.9249
## F-statistic: 506.2 on 12 and 480 DF, p-value: < 2.2e-16
#Check Assumptions of Regression for Model Traditional_lm_trimmed_train_2
#Normality of residuals
#1.Histogram
hist(Traditional_lm_trimmed_train_2$residuals, probability = TRUE, main = "Histogram of residuals in Model 'Traditional_lm_trimmed_train_2'", breaks = 25)
lines(density(Traditional_lm_trimmed_train_2$residuals), col = "red")
#QQ-plot
qqnorm(Traditional_lm_trimmed_train_2$residuals, main = "QQ Plot of residuals of Model 'Traditional_lm_trimmed_train_2'", pch = 20)
qqline(Traditional_lm_trimmed_train_2$residuals, col = "red")
#Test for Skewness and Kurtosis
jarque.test(Traditional_lm_trimmed_train_2$residuals)
##
## Jarque-Bera Normality Test
##
## data: Traditional_lm_trimmed_train_2$residuals
## JB = 1072.6, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_trimmed_train_2$residuals)
##
## Anscombe-Glynn kurtosis test
##
## data: Traditional_lm_trimmed_train_2$residuals
## kurt = 9.6074, z = 8.7112, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
#Cook's distance > 1
which(cooks.distance(Traditional_lm_trimmed_train_2)>1)
## named integer(0)
#Diagnostic Plots
plot(Traditional_lm_trimmed_train_2)
#Check multicollinearity
car :: vif(Traditional_lm_trimmed_train_2)# multicollinearity problem.
## Private Accept Enroll Top10perc F.Undergrad P.Undergrad
## 2.483746 6.455307 16.069740 2.024555 11.587167 1.832461
## Outstate Room.Board Books S.F.Ratio Expend Grad.Rate
## 3.952362 1.930230 1.067001 1.768344 2.480500 1.726186
#Conclusion: severe violation of regression assumption
#Bad model!
Test the Model Traditional_lm_trimmed_train_2
#Prediction
pred_Tlm_trimmed <- predict(Traditional_lm_trimmed_train_2, test)
head(pred_Tlm_trimmed,5)
## 8 13 15 18 20
## 2473.9270 1365.9194 475.2189 1000.1578 2897.7983
#Absolute error mean, median, sd, max, min-------
abs_err_Tlm_trimmed <- abs(pred_Tlm_trimmed - test$Apps)
mean(abs_err_Tlm_trimmed)
## [1] 511.9436
median(abs_err_Tlm_trimmed)
## [1] 253.6845
sd(abs_err_Tlm_trimmed)
## [1] 910.2706
range(abs_err_Tlm_trimmed)
## [1] 4.905429 9327.357739
#histogram and boxplot
hist(abs_err_Tlm_trimmed, breaks = 15)
boxplot(abs_err_Tlm_trimmed)
#Actual vs. Predicted
plot(test$Apps, pred_Tlm_trimmed , xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
library("MASS")
box_results <- boxcox(Apps ~ ., data = train, lambda = seq(-5, 5, 0.1))
box_results <- data.frame(box_results$x, box_results$y)
# Create a data frame with the results
lambda_1 <- box_results[which(box_results$box_results.y == max(box_results$box_results.y)), 1]
lambda_1
## [1] 0.5
#lambda is different from 0 .So y=(y^lambda-1)/lambda
#Transformation
train$transform_Apps <-((train$Apps ^ lambda_1) - 1) / lambda_1
lm_transform_Apps_1 <- lm(transform_Apps ~ . - Apps, data = train)
summary(lm_transform_Apps_1)
##
## Call:
## lm(formula = transform_Apps ~ . - Apps, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.495 -9.039 0.343 8.187 71.446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.931e+01 7.347e+00 -3.989 7.56e-05 ***
## PrivateNot Private 2.122e+01 2.735e+00 7.760 4.46e-14 ***
## Accept 1.533e-02 1.003e-03 15.288 < 2e-16 ***
## Enroll 4.950e-03 3.866e-03 1.280 0.201048
## Top10perc 4.699e-01 1.063e-01 4.422 1.19e-05 ***
## Top25perc -1.089e-02 8.476e-02 -0.128 0.897817
## F.Undergrad -1.356e-04 5.951e-04 -0.228 0.819907
## P.Undergrad 1.136e-03 5.888e-04 1.928 0.054345 .
## Outstate 1.981e-04 3.756e-04 0.527 0.598132
## Room.Board 3.517e-03 9.550e-04 3.683 0.000254 ***
## Books 1.182e-02 4.850e-03 2.437 0.015145 *
## Personal 1.390e-03 1.192e-03 1.166 0.243956
## PhD 1.001e-01 8.932e-02 1.120 0.263124
## Terminal -3.558e-02 9.563e-02 -0.372 0.710046
## S.F.Ratio 1.097e+00 2.452e-01 4.472 9.50e-06 ***
## perc.alumni -1.355e-01 7.927e-02 -1.709 0.088061 .
## Expend 1.072e-03 2.194e-04 4.885 1.38e-06 ***
## Grad.Rate 2.485e-01 5.610e-02 4.430 1.15e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.8 on 525 degrees of freedom
## Multiple R-squared: 0.9075, Adjusted R-squared: 0.9045
## F-statistic: 302.9 on 17 and 525 DF, p-value: < 2.2e-16
#Delete insignificant variables
lm_transform_Apps_2 <- lm(transform_Apps ~ . - Apps-Enroll-Top25perc-F.Undergrad-Outstate-Personal-PhD-Terminal , data = train)
summary(lm_transform_Apps_2)
##
## Call:
## lm(formula = transform_Apps ~ . - Apps - Enroll - Top25perc -
## F.Undergrad - Outstate - Personal - PhD - Terminal, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -106.362 -9.468 0.176 8.250 74.767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.314e+01 6.320e+00 -3.662 0.000276 ***
## PrivateNot Private 2.219e+01 2.338e+00 9.495 < 2e-16 ***
## Accept 1.680e-02 4.197e-04 40.031 < 2e-16 ***
## Top10perc 5.105e-01 5.950e-02 8.579 < 2e-16 ***
## P.Undergrad 1.556e-03 5.373e-04 2.895 0.003946 **
## Room.Board 3.452e-03 8.276e-04 4.171 3.54e-05 ***
## Books 1.193e-02 4.683e-03 2.548 0.011115 *
## S.F.Ratio 1.087e+00 2.429e-01 4.474 9.41e-06 ***
## perc.alumni -1.155e-01 7.526e-02 -1.535 0.125476
## Expend 1.128e-03 2.076e-04 5.436 8.32e-08 ***
## Grad.Rate 2.421e-01 5.511e-02 4.394 1.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.79 on 532 degrees of freedom
## Multiple R-squared: 0.9064, Adjusted R-squared: 0.9046
## F-statistic: 515.2 on 10 and 532 DF, p-value: < 2.2e-16
#Delete insignificant variables
lm_transform_Apps_3 <- lm(transform_Apps ~ . - Apps-Enroll-Top25perc-F.Undergrad-Outstate-Personal-PhD-Terminal-perc.alumni , data = train)
summary(lm_transform_Apps_3)
##
## Call:
## lm(formula = transform_Apps ~ . - Apps - Enroll - Top25perc -
## F.Undergrad - Outstate - Personal - PhD - Terminal - perc.alumni,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.515 -9.250 0.223 8.351 74.224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.503e+01 6.207e+00 -4.032 6.33e-05 ***
## PrivateNot Private 2.268e+01 2.319e+00 9.777 < 2e-16 ***
## Accept 1.689e-02 4.161e-04 40.598 < 2e-16 ***
## Top10perc 4.922e-01 5.837e-02 8.432 3.20e-16 ***
## P.Undergrad 1.579e-03 5.378e-04 2.937 0.00346 **
## Room.Board 3.546e-03 8.263e-04 4.291 2.11e-05 ***
## Books 1.218e-02 4.687e-03 2.599 0.00962 **
## S.F.Ratio 1.130e+00 2.416e-01 4.678 3.67e-06 ***
## Expend 1.100e-03 2.070e-04 5.314 1.58e-07 ***
## Grad.Rate 2.198e-01 5.322e-02 4.130 4.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.81 on 533 degrees of freedom
## Multiple R-squared: 0.906, Adjusted R-squared: 0.9044
## F-statistic: 570.7 on 9 and 533 DF, p-value: < 2.2e-16
#Check Assumptions of Regression----
#Normality of residuals
#1.Histogram
hist(lm_transform_Apps_3$residuals, probability = TRUE, main = "Histogram of residuals in Model 'lm_transform_Apps_3'", breaks = 25)
lines(density(lm_transform_Apps_3$residuals), col = "red")
#2.QQ-plot
qqnorm(lm_transform_Apps_3$residuals, main = "QQ Plot of residuals of Model 'lm_transform_Apps_3'")
qqline(lm_transform_Apps_3$residuals, col = "red")
#improve Model
#3.Test for Skewness and Kurtosis
jarque.test(lm_transform_Apps_3$residuals)
##
## Jarque-Bera Normality Test
##
## data: lm_transform_Apps_3$residuals
## JB = 601.94, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(lm_transform_Apps_3$residuals)
##
## Anscombe-Glynn kurtosis test
##
## data: lm_transform_Apps_3$residuals
## kurt = 8.1514, z = 8.3056, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
plot(lm_transform_Apps_3)
car :: vif(lm_transform_Apps_3)
## Private Accept Top10perc P.Undergrad Room.Board Books
## 2.091806 1.773306 2.034090 1.427575 1.555793 1.045636
## S.F.Ratio Expend Grad.Rate
## 1.804452 2.449353 1.642820
#no multicollinearity problem.
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
Test the Model Prediction:lm_transform_Apps_3
pred_lm_transform <- predict(lm_transform_Apps_3, test)
pred_lm_transform<-((lambda_1*pred_lm_transform)+1)^(1/lambda_1)
head(pred_lm_transform,5)
## 8 13 15 18 20
## 2072.3496 1375.0491 623.1701 804.3528 2728.4601
#Absolute error mean, median, sd, max, min-------
abs_err_lm_transform <- abs(pred_lm_transform - test$Apps)
mean(abs_err_lm_transform)
## [1] 711.1906
median(abs_err_lm_transform)
## [1] 370.7594
sd(abs_err_lm_transform)
## [1] 1116.981
range(abs_err_lm_transform)
## [1] 0.5630465 7773.2584894
#histogram and boxplot
hist(abs_err_lm_transform, breaks = 15)
boxplot(abs_err_lm_transform)
#Actual vs. Predicted
plot(test$Apps, pred_lm_transform, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
It seems that the outlier data is a hiegh effect on the model.so we removed outliers in “Test” data set
trimmed_test<-test[-which(test$Apps > tukey_1),]
dim(test)
## [1] 233 18
dim(trimmed_test)
## [1] 214 18
#Model Prediction:lm_transform_Apps_3 with trimmed_test data--------
pred_lm_transform_trimmed_test <- predict(lm_transform_Apps_3, trimmed_test)
pred_lm_transform_trimmed_test<-((lambda_1*pred_lm_transform_trimmed_test)+1)^(1/lambda_1)
head(pred_lm_transform_trimmed_test,5)
## 8 13 15 18 20
## 2072.3496 1375.0491 623.1701 804.3528 2728.4601
#Absolute error mean, median, sd, max, min with trimmed_test data-------
abs_err_lm_transform_trimmed_test <- abs(pred_lm_transform_trimmed_test - trimmed_test$Apps)
mean(abs_err_lm_transform_trimmed_test)
## [1] 498.4068
median(abs_err_lm_transform_trimmed_test)
## [1] 326.481
sd(abs_err_lm_transform_trimmed_test)
## [1] 567.9842
range(abs_err_lm_transform_trimmed_test)
## [1] 0.5630465 4022.9916692
#histogram and boxplot
hist(abs_err_lm_transform_trimmed_test, breaks = 15)
boxplot(abs_err_lm_transform_trimmed_test)
#Actual vs. Predicted
plot(trimmed_test$Apps,pred_lm_transform_trimmed_test, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
library("leaps")
dim(train)
## [1] 543 19
#Best Subset Selection---------------------------
bestsub_1 <- regsubsets(transform_Apps~ . - Apps, nvmax = 17, data = train, method = "exhaustive")
summary(bestsub_1)
## Subset selection object
## Call: regsubsets.formula(transform_Apps ~ . - Apps, nvmax = 17, data = train,
## method = "exhaustive")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateNot Private FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: exhaustive
## PrivateNot Private Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " "
## 4 ( 1 ) "*" "*" " " "*" " " " "
## 5 ( 1 ) "*" "*" " " "*" " " " "
## 6 ( 1 ) "*" "*" " " "*" " " " "
## 7 ( 1 ) "*" "*" " " "*" " " " "
## 8 ( 1 ) "*" "*" " " "*" " " " "
## 9 ( 1 ) "*" "*" " " "*" " " " "
## 10 ( 1 ) "*" "*" " " "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" " " " "
## 14 ( 1 ) "*" "*" "*" "*" " " " "
## 15 ( 1 ) "*" "*" "*" "*" " " " "
## 16 ( 1 ) "*" "*" "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*"
## P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " "*" " " " " " " " " " "
## 5 ( 1 ) " " " " "*" " " " " " " " " " "
## 6 ( 1 ) " " " " "*" " " " " " " " " "*"
## 7 ( 1 ) " " " " "*" " " " " " " " " "*"
## 8 ( 1 ) "*" " " "*" " " " " " " " " "*"
## 9 ( 1 ) "*" " " "*" "*" " " " " " " "*"
## 10 ( 1 ) "*" " " "*" "*" " " " " " " "*"
## 11 ( 1 ) "*" " " "*" "*" " " " " " " "*"
## 12 ( 1 ) "*" " " "*" "*" " " "*" " " "*"
## 13 ( 1 ) "*" " " "*" "*" "*" "*" " " "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " "*" " "
## 6 ( 1 ) " " "*" " "
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) " " "*" "*"
## 9 ( 1 ) " " "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
#Model Selection
#1.Adjusted R-squared
summary(bestsub_1)$rsq
## [1] 0.8317380 0.8699219 0.8860704 0.8946646 0.8971094 0.9010831 0.9031614
## [8] 0.9047911 0.9059823 0.9063966 0.9068195 0.9071595 0.9073828 0.9074286
## [15] 0.9074571 0.9074673 0.9074702
summary(bestsub_1)$adjr2
## [1] 0.8314270 0.8694402 0.8854363 0.8938815 0.8961514 0.8999758 0.9018944
## [8] 0.9033647 0.9043947 0.9046372 0.9048892 0.9050575 0.9051067 0.9049741
## [15] 0.9048230 0.9046526 0.9044740
#Plot Adjusted R-squared
plot(summary(bestsub_1)$adjr2,
type = "b",
xlab = "# of Variables",
ylab = "AdjR2",
xaxt = 'n',
xlim = c(1, 17)); grid()
axis(1, at = 1:17, labels = 1:17)
points(which.max(summary(bestsub_1)$adjr2),
summary(bestsub_1)$adjr2[which.max(summary(bestsub_1)$adjr2)],
col = "red", cex = 2, pch = 20)
#Consider Adjusted R-squared, best Model is Model with 13 variables.
#Coefficients of the best model---
coef(bestsub_1, 13) #Model w/ 13 variables
## (Intercept) PrivateNot Private Accept Enroll
## -29.566390023 20.444832553 0.015437816 0.004064957
## Top10perc P.Undergrad Room.Board Books
## 0.463501309 0.001106348 0.003623007 0.011364203
## Personal PhD S.F.Ratio perc.alumni
## 0.001334815 0.081410586 1.083955541 -0.131028448
## Expend Grad.Rate
## 0.001101581 0.252814547
bestsub_adjr2 <- lm(transform_Apps ~Private+Accept+ Enroll+Top10perc+ P.Undergrad+Room.Board+
Books+Personal+PhD+S.F.Ratio+perc.alumni+Expend+Grad.Rate, data = train)
summary(bestsub_adjr2)
##
## Call:
## lm(formula = transform_Apps ~ Private + Accept + Enroll + Top10perc +
## P.Undergrad + Room.Board + Books + Personal + PhD + S.F.Ratio +
## perc.alumni + Expend + Grad.Rate, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.266 -9.208 0.192 8.372 71.481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.957e+01 6.996e+00 -4.226 2.80e-05 ***
## PrivateNot Private 2.044e+01 2.461e+00 8.308 8.23e-16 ***
## Accept 1.544e-02 9.681e-04 15.946 < 2e-16 ***
## Enroll 4.065e-03 2.717e-03 1.496 0.1352
## Top10perc 4.635e-01 6.400e-02 7.242 1.57e-12 ***
## P.Undergrad 1.106e-03 5.733e-04 1.930 0.0542 .
## Room.Board 3.623e-03 8.812e-04 4.112 4.56e-05 ***
## Books 1.136e-02 4.762e-03 2.387 0.0174 *
## Personal 1.335e-03 1.182e-03 1.129 0.2593
## PhD 8.141e-02 5.981e-02 1.361 0.1741
## S.F.Ratio 1.084e+00 2.432e-01 4.457 1.01e-05 ***
## perc.alumni -1.310e-01 7.681e-02 -1.706 0.0886 .
## Expend 1.102e-03 2.088e-04 5.275 1.94e-07 ***
## Grad.Rate 2.528e-01 5.539e-02 4.564 6.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.75 on 529 degrees of freedom
## Multiple R-squared: 0.9074, Adjusted R-squared: 0.9051
## F-statistic: 398.7 on 13 and 529 DF, p-value: < 2.2e-16
Test the Model according to adjr2
#Prediction
pred_bestsub_adj2 <- predict(bestsub_adjr2, test)
head(pred_bestsub_adj2,5)
## 8 13 15 18 20
## 88.24948 71.46674 46.18745 54.67089 102.34381
pred_bestsub_adj2 <-((lambda_1*pred_bestsub_adj2)+1)^(1/lambda_1)
head(pred_bestsub_adj2,5)
## 8 13 15 18 20
## 2036.2422 1349.3405 580.5075 802.8974 2721.9076
#Absolute error mean, median, sd, max, min-------
abs_err_bestsub_adj2 <- abs(pred_bestsub_adj2 - test$Apps)
mean(abs_err_bestsub_adj2)
## [1] 722.8981
median(abs_err_bestsub_adj2)
## [1] 378.1958
sd(abs_err_bestsub_adj2)
## [1] 1173.394
range(abs_err_bestsub_adj2)
## [1] 2.969436 7858.989501
#histogram and boxplot
hist(abs_err_bestsub_adj2, breaks = 15)
boxplot(abs_err_bestsub_adj2)
#Actual vs. Predicted
plot(test$Apps, pred_bestsub_adj2, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
#Plot Cp
plot(summary(bestsub_1)$cp,
type = "b",
xlab = "# of Variables",
ylab = "Cp",
xaxt = 'n',
xlim = c(1, 17)); grid()
axis(1, at = 1:17, labels = 1:17)
points(which.min(summary(bestsub_1)$cp),
summary(bestsub_1)$cp[which.min(summary(bestsub_1)$cp)],
col = "red", cex = 2, pch = 20)
#Consider CP, best Model is Model with 11 variables.
#Coefficients of the best model---
coef(bestsub_1, 11) #Model w/ 11 variables
## (Intercept) PrivateNot Private Accept Enroll
## -24.293342083 21.502050003 0.015453496 0.004206017
## Top10perc P.Undergrad Room.Board Books
## 0.489014606 0.001277759 0.003781491 0.011561276
## S.F.Ratio perc.alumni Expend Grad.Rate
## 1.079085351 -0.127580465 0.001143855 0.251775079
bestsub_CP <- lm(transform_Apps ~Private+Accept+ Enroll+Top10perc+ P.Undergrad+Room.Board+
Books+S.F.Ratio+perc.alumni+Expend+Grad.Rate, data = train)
summary(bestsub_CP)
##
## Call:
## lm(formula = transform_Apps ~ Private + Accept + Enroll + Top10perc +
## P.Undergrad + Room.Board + Books + S.F.Ratio + perc.alumni +
## Expend + Grad.Rate, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.742 -9.390 0.207 8.257 71.342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.429e+01 6.355e+00 -3.823 0.000148 ***
## PrivateNot Private 2.150e+01 2.377e+00 9.047 < 2e-16 ***
## Accept 1.545e-02 9.650e-04 16.014 < 2e-16 ***
## Enroll 4.206e-03 2.709e-03 1.552 0.121163
## Top10perc 4.890e-01 6.101e-02 8.016 7.01e-15 ***
## P.Undergrad 1.278e-03 5.657e-04 2.259 0.024297 *
## Room.Board 3.782e-03 8.533e-04 4.432 1.14e-05 ***
## Books 1.156e-02 4.683e-03 2.469 0.013879 *
## S.F.Ratio 1.079e+00 2.427e-01 4.447 1.06e-05 ***
## perc.alumni -1.276e-01 7.557e-02 -1.688 0.091932 .
## Expend 1.144e-03 2.075e-04 5.511 5.56e-08 ***
## Grad.Rate 2.518e-01 5.539e-02 4.546 6.79e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.77 on 531 degrees of freedom
## Multiple R-squared: 0.9068, Adjusted R-squared: 0.9049
## F-statistic: 469.8 on 11 and 531 DF, p-value: < 2.2e-16
Test the Model according to CP
#Prediction
pred_bestsub_CP <- predict(bestsub_CP, test)
head(pred_bestsub_CP,5)
## 8 13 15 18 20
## 87.51193 72.79531 47.11327 56.10697 102.87272
pred_bestsub_CP <-((lambda_1*pred_bestsub_CP)+1)^(1/lambda_1)
head(pred_bestsub_CP,5)
## 8 13 15 18 20
## 2003.0964 1398.5845 603.0283 844.1050 2749.5720
#Absolute error mean, median, sd, max, min-------
abs_err_bestsub_CP <- abs(pred_bestsub_CP - test$Apps)
mean(abs_err_bestsub_CP)
## [1] 724.3431
median(abs_err_bestsub_CP)
## [1] 363.9783
sd(abs_err_bestsub_CP)
## [1] 1174.198
range(abs_err_bestsub_CP)
## [1] 0.9072678 7946.2796208
#histogram and boxplot
hist(abs_err_bestsub_CP, breaks = 15)
boxplot(abs_err_bestsub_CP)
#Actual vs. Predicted
plot(test$Apps, pred_bestsub_CP, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
#Plot BIC
plot(summary(bestsub_1)$bic,
type = "b",
xlab = "# of Variables",
ylab = "BIC",
xaxt = 'n',
xlim = c(1, 17)); grid()
axis(1, at = 1:17, labels = 1:17)
points(which.min(summary(bestsub_1)$bic),
summary(bestsub_1)$bic[which.min(summary(bestsub_1)$bic)],
col = "red", cex = 2, pch = 20)
#Consider BIC, best Model is Model with 9 variables.
coef(bestsub_1, 9) #Model w/ 9 variables
## (Intercept) PrivateNot Private Accept Top10perc
## -25.027568709 22.676676201 0.016893239 0.492189120
## P.Undergrad Room.Board Books S.F.Ratio
## 0.001579532 0.003545740 0.012178874 1.130147449
## Expend Grad.Rate
## 0.001100015 0.219789893
bestsub_BIC <- lm(transform_Apps ~Private+Accept+Top10perc+ P.Undergrad+Room.Board+
Books+S.F.Ratio+Expend+Grad.Rate, data = train)
summary(bestsub_BIC)
##
## Call:
## lm(formula = transform_Apps ~ Private + Accept + Top10perc +
## P.Undergrad + Room.Board + Books + S.F.Ratio + Expend + Grad.Rate,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.515 -9.250 0.223 8.351 74.224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.503e+01 6.207e+00 -4.032 6.33e-05 ***
## PrivateNot Private 2.268e+01 2.319e+00 9.777 < 2e-16 ***
## Accept 1.689e-02 4.161e-04 40.598 < 2e-16 ***
## Top10perc 4.922e-01 5.837e-02 8.432 3.20e-16 ***
## P.Undergrad 1.579e-03 5.378e-04 2.937 0.00346 **
## Room.Board 3.546e-03 8.263e-04 4.291 2.11e-05 ***
## Books 1.218e-02 4.687e-03 2.599 0.00962 **
## S.F.Ratio 1.130e+00 2.416e-01 4.678 3.67e-06 ***
## Expend 1.100e-03 2.070e-04 5.314 1.58e-07 ***
## Grad.Rate 2.198e-01 5.322e-02 4.130 4.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.81 on 533 degrees of freedom
## Multiple R-squared: 0.906, Adjusted R-squared: 0.9044
## F-statistic: 570.7 on 9 and 533 DF, p-value: < 2.2e-16
Test the Model according to BIC
#Prediction
pred_bestsub_BIC <- predict(bestsub_BIC, test)
head(pred_bestsub_BIC,5)
## 8 13 15 18 20
## 89.04613 72.16331 47.92675 54.72223 102.46933
pred_bestsub_BIC <-((lambda_1*pred_bestsub_BIC)+1)^(1/lambda_1)
head(pred_bestsub_BIC,5)
## 8 13 15 18 20
## 2072.3496 1375.0491 623.1701 804.3528 2728.4601
#Absolute error mean, median, sd, max, min-------
abs_err_bestsub_BIC <- abs(pred_bestsub_BIC - test$Apps)
mean(abs_err_bestsub_BIC)
## [1] 711.1906
median(abs_err_bestsub_BIC)
## [1] 370.7594
sd(abs_err_bestsub_BIC)
## [1] 1116.981
range(abs_err_bestsub_BIC)
## [1] 0.5630465 7773.2584894
#histogram and boxplot
hist(abs_err_bestsub_BIC, breaks = 15)
boxplot(abs_err_bestsub_BIC)
#Actual vs. Predicted
plot(test$Apps, pred_bestsub_BIC, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
fwd_1 <- regsubsets(transform_Apps ~ . - Apps, nvmax = 17, data = train, method = "forward")
summary(fwd_1)
## Subset selection object
## Call: regsubsets.formula(transform_Apps ~ . - Apps, nvmax = 17, data = train,
## method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateNot Private FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
## PrivateNot Private Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " "
## 4 ( 1 ) "*" "*" " " "*" " " " "
## 5 ( 1 ) "*" "*" " " "*" " " " "
## 6 ( 1 ) "*" "*" " " "*" " " " "
## 7 ( 1 ) "*" "*" " " "*" " " " "
## 8 ( 1 ) "*" "*" " " "*" " " " "
## 9 ( 1 ) "*" "*" " " "*" " " " "
## 10 ( 1 ) "*" "*" " " "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" " " " "
## 14 ( 1 ) "*" "*" "*" "*" " " " "
## 15 ( 1 ) "*" "*" "*" "*" " " " "
## 16 ( 1 ) "*" "*" "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*"
## P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " "*" " " " " " " " " " "
## 5 ( 1 ) " " " " "*" " " " " " " " " " "
## 6 ( 1 ) " " " " "*" " " " " " " " " "*"
## 7 ( 1 ) " " " " "*" " " " " " " " " "*"
## 8 ( 1 ) "*" " " "*" " " " " " " " " "*"
## 9 ( 1 ) "*" " " "*" "*" " " " " " " "*"
## 10 ( 1 ) "*" " " "*" "*" " " " " " " "*"
## 11 ( 1 ) "*" " " "*" "*" " " " " " " "*"
## 12 ( 1 ) "*" " " "*" "*" " " "*" " " "*"
## 13 ( 1 ) "*" " " "*" "*" "*" "*" " " "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " "*" " "
## 6 ( 1 ) " " "*" " "
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) " " "*" "*"
## 9 ( 1 ) " " "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
which.max(summary(fwd_1)$adjr2)
## [1] 13
which.min(summary(fwd_1)$cp)
## [1] 11
which.min(summary(fwd_1)$bic)
## [1] 9
# Backward Stepwise Selection----------
bwd_1 <- regsubsets(transform_Apps ~ . - Apps, nvmax = 17, data = train, method = "backward")
summary(bwd_1)
## Subset selection object
## Call: regsubsets.formula(transform_Apps ~ . - Apps, nvmax = 17, data = train,
## method = "backward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateNot Private FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: backward
## PrivateNot Private Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " "
## 4 ( 1 ) "*" "*" " " "*" " " " "
## 5 ( 1 ) "*" "*" " " "*" " " " "
## 6 ( 1 ) "*" "*" " " "*" " " " "
## 7 ( 1 ) "*" "*" " " "*" " " " "
## 8 ( 1 ) "*" "*" " " "*" " " " "
## 9 ( 1 ) "*" "*" " " "*" " " " "
## 10 ( 1 ) "*" "*" " " "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" " " " "
## 14 ( 1 ) "*" "*" "*" "*" " " " "
## 15 ( 1 ) "*" "*" "*" "*" " " " "
## 16 ( 1 ) "*" "*" "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*"
## P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " "*" " " " " " " " " " "
## 5 ( 1 ) " " " " "*" " " " " " " " " " "
## 6 ( 1 ) " " " " "*" " " " " " " " " "*"
## 7 ( 1 ) " " " " "*" " " " " " " " " "*"
## 8 ( 1 ) "*" " " "*" " " " " " " " " "*"
## 9 ( 1 ) "*" " " "*" "*" " " " " " " "*"
## 10 ( 1 ) "*" " " "*" "*" " " " " " " "*"
## 11 ( 1 ) "*" " " "*" "*" " " " " " " "*"
## 12 ( 1 ) "*" " " "*" "*" " " "*" " " "*"
## 13 ( 1 ) "*" " " "*" "*" "*" "*" " " "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " "*" " "
## 6 ( 1 ) " " "*" " "
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) " " "*" "*"
## 9 ( 1 ) " " "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
which.max(summary(bwd_1)$adjr2)
## [1] 13
which.min(summary(bwd_1)$cp)
## [1] 11
which.min(summary(bwd_1)$bic)
## [1] 9
which.max(summary(bestsub_1)$adjr2)
## [1] 13
which.min(summary(bestsub_1)$cp)
## [1] 11
which.min(summary(bestsub_1)$bic)
## [1] 9
coef(bestsub_1, 9)
## (Intercept) PrivateNot Private Accept Top10perc
## -25.027568709 22.676676201 0.016893239 0.492189120
## P.Undergrad Room.Board Books S.F.Ratio
## 0.001579532 0.003545740 0.012178874 1.130147449
## Expend Grad.Rate
## 0.001100015 0.219789893
coef(fwd_1, 9)
## (Intercept) PrivateNot Private Accept Top10perc
## -25.027568709 22.676676201 0.016893239 0.492189120
## P.Undergrad Room.Board Books S.F.Ratio
## 0.001579532 0.003545740 0.012178874 1.130147449
## Expend Grad.Rate
## 0.001100015 0.219789893
coef(bwd_1, 9)
## (Intercept) PrivateNot Private Accept Top10perc
## -25.027568709 22.676676201 0.016893239 0.492189120
## P.Undergrad Room.Board Books S.F.Ratio
## 0.001579532 0.003545740 0.012178874 1.130147449
## Expend Grad.Rate
## 0.001100015 0.219789893
#The result for Best Subset Selection ,forward and backward is the same!----
k <- 10
set.seed(123)
folds <- sample(1:k, nrow(train), rep = TRUE)
cv_errors <- matrix(NA, k, 17, dimnames = list(NULL , paste(1:17)))
#Prediction function
predict_regsubsets <- function(object, newdata, id) {
reg_formula <- as.formula(object$call[[2]])
mat <- model.matrix(reg_formula, newdata)
coef_i <- coef(object, id = id)
mat[, names(coef_i)] %*% coef_i
}
#K-fold Cross Validation
set.seed(1234)
for(i in 1:k){
best_fit <- regsubsets( transform_Apps~ . - Apps, data = train[folds != i,], nvmax = 17, method = "exhaustive")
for(j in 1:17){
pred <- predict_regsubsets(best_fit, newdata = train[folds == i,], id = j)
cv_errors[i, j] <- mean((train$transform_Apps[folds == i] - pred) ^ 2)
}
}
head(cv_errors,5)
## 1 2 3 4 5 6 7 8
## [1,] 382.1926 327.9241 360.3289 311.8102 317.9707 290.6581 307.9539 309.1378
## [2,] 505.1825 369.3586 337.0397 306.5241 392.4758 321.7692 326.4828 322.4399
## [3,] 365.9648 340.7436 247.2648 221.8067 227.3207 208.1610 187.9396 181.0177
## [4,] 358.9736 275.9457 218.6549 206.6556 203.4075 189.5014 185.5489 187.5722
## [5,] 953.2446 739.2839 599.8729 556.2435 583.9070 567.5726 542.7559 542.6968
## 9 10 11 12 13 14 15 16
## [1,] 293.2583 306.2096 311.9493 322.8454 319.1472 309.7009 310.1036 310.0975
## [2,] 315.0029 313.4088 307.4334 306.1009 305.0323 306.2159 309.3954 308.3572
## [3,] 168.0497 166.1811 163.4508 162.3656 161.1758 163.4316 163.4525 163.2104
## [4,] 194.9740 204.4906 205.5274 203.9576 206.6665 203.1115 203.7468 203.1323
## [5,] 536.8341 537.0849 528.9368 524.8481 517.4240 519.5601 520.1017 520.4460
## 17
## [1,] 310.1188
## [2,] 309.0627
## [3,] 163.1106
## [4,] 203.0819
## [5,] 519.6718
mean_cv_erros <- apply(cv_errors, 2, mean)
mean_cv_erros
## 1 2 3 4 5 6 7 8
## 507.0688 399.3343 349.9750 324.6412 337.9616 318.7059 305.2901 307.7448
## 9 10 11 12 13 14 15 16
## 301.3743 306.5500 304.4494 303.7560 303.1106 302.6839 302.8821 302.9425
## 17
## 302.9120
plot(mean_cv_erros, type = "b")
which.min(mean_cv_erros)
## 9
## 9
#Model with 9 variables
#Coefficients of the best model
coef(bestsub_1, 9)
## (Intercept) PrivateNot Private Accept Top10perc
## -25.027568709 22.676676201 0.016893239 0.492189120
## P.Undergrad Room.Board Books S.F.Ratio
## 0.001579532 0.003545740 0.012178874 1.130147449
## Expend Grad.Rate
## 0.001100015 0.219789893
bestsub_cv_1 <- lm(transform_Apps ~Private+Accept+Top10perc+ P.Undergrad+Room.Board+
Books+S.F.Ratio+Expend+Grad.Rate, data = train)
summary(bestsub_cv_1)
##
## Call:
## lm(formula = transform_Apps ~ Private + Accept + Top10perc +
## P.Undergrad + Room.Board + Books + S.F.Ratio + Expend + Grad.Rate,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.515 -9.250 0.223 8.351 74.224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.503e+01 6.207e+00 -4.032 6.33e-05 ***
## PrivateNot Private 2.268e+01 2.319e+00 9.777 < 2e-16 ***
## Accept 1.689e-02 4.161e-04 40.598 < 2e-16 ***
## Top10perc 4.922e-01 5.837e-02 8.432 3.20e-16 ***
## P.Undergrad 1.579e-03 5.378e-04 2.937 0.00346 **
## Room.Board 3.546e-03 8.263e-04 4.291 2.11e-05 ***
## Books 1.218e-02 4.687e-03 2.599 0.00962 **
## S.F.Ratio 1.130e+00 2.416e-01 4.678 3.67e-06 ***
## Expend 1.100e-03 2.070e-04 5.314 1.58e-07 ***
## Grad.Rate 2.198e-01 5.322e-02 4.130 4.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.81 on 533 degrees of freedom
## Multiple R-squared: 0.906, Adjusted R-squared: 0.9044
## F-statistic: 570.7 on 9 and 533 DF, p-value: < 2.2e-16
#Normality of residuals
#Test for Skewness and Kurtosis
jarque.test(bestsub_cv_1$residuals)
##
## Jarque-Bera Normality Test
##
## data: bestsub_cv_1$residuals
## JB = 601.94, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(bestsub_cv_1$residuals)
##
## Anscombe-Glynn kurtosis test
##
## data: bestsub_cv_1$residuals
## kurt = 8.1514, z = 8.3056, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
# Reject Assumption of Normality of residuals
plot(bestsub_cv_1)
car :: vif(bestsub_cv_1)
## Private Accept Top10perc P.Undergrad Room.Board Books
## 2.091806 1.773306 2.034090 1.427575 1.555793 1.045636
## S.F.Ratio Expend Grad.Rate
## 1.804452 2.449353 1.642820
#no multicollinearity problem.
Test the Model bestsub_cv
#Prediction
pred_bestsub_cv <- predict(bestsub_cv_1, test)
pred_bestsub_cv <- ((lambda_1*pred_bestsub_cv)+1)^(1/lambda_1)
#Absolute error mean, median, sd, max, min-------
abs_err_bestsub_cv <- abs(pred_bestsub_cv - test$Apps)
mean(abs_err_bestsub_cv)
## [1] 711.1906
median(abs_err_bestsub_cv)
## [1] 370.7594
sd(abs_err_bestsub_cv)
## [1] 1116.981
range(abs_err_bestsub_cv)
## [1] 0.5630465 7773.2584894
#histogram and boxplot
hist(abs_err_bestsub_cv, breaks = 15)
boxplot(abs_err_bestsub_cv)
#Actual vs. Predicted
plot(test$Apps, pred_bestsub_cv, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
#Add transform_Apps
View(trimmed_train)
trimmed_train$transform_Apps <- ((trimmed_train$Apps ^ lambda_1) - 1) / lambda_1
trimmed_bestsub_1 <- regsubsets(transform_Apps ~ . - Apps, nvmax = 18, data = trimmed_train, method = "exhaustive")
summary(trimmed_bestsub_1)
## Subset selection object
## Call: regsubsets.formula(transform_Apps ~ . - Apps, nvmax = 18, data = trimmed_train,
## method = "exhaustive")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateNot Private FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: exhaustive
## PrivateNot Private Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " "
## 4 ( 1 ) "*" "*" "*" "*" " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " " "
## 6 ( 1 ) "*" "*" "*" "*" " " " "
## 7 ( 1 ) "*" "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" "*" "*" " " " "
## 9 ( 1 ) "*" "*" "*" "*" " " " "
## 10 ( 1 ) "*" "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*"
## P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " "*" " " " " " " " "
## 6 ( 1 ) " " " " " " "*" " " " " " " "*"
## 7 ( 1 ) " " " " " " "*" " " "*" " " "*"
## 8 ( 1 ) " " " " " " "*" " " "*" " " "*"
## 9 ( 1 ) "*" " " " " "*" " " "*" " " "*"
## 10 ( 1 ) "*" " " " " "*" " " "*" " " "*"
## 11 ( 1 ) "*" " " "*" "*" " " "*" " " "*"
## 12 ( 1 ) "*" " " "*" "*" " " "*" " " "*"
## 13 ( 1 ) "*" " " "*" "*" " " "*" " " "*"
## 14 ( 1 ) "*" " " "*" "*" "*" "*" " " "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " "*"
## 9 ( 1 ) " " " " "*"
## 10 ( 1 ) " " "*" "*"
## 11 ( 1 ) " " "*" "*"
## 12 ( 1 ) " " "*" "*"
## 13 ( 1 ) " " "*" "*"
## 14 ( 1 ) " " "*" "*"
## 15 ( 1 ) " " "*" "*"
## 16 ( 1 ) " " "*" "*"
## 17 ( 1 ) "*" "*" "*"
which.max(summary(trimmed_bestsub_1)$adjr2)
## [1] 12
which.min(summary(trimmed_bestsub_1)$cp)
## [1] 12
which.min(summary(trimmed_bestsub_1)$bic)
## [1] 9
#Coefficients of the best model
coef(trimmed_bestsub_1, 12) #Model w/ 12 variables
## (Intercept) PrivateNot Private Accept Enroll
## -8.5625334646 10.5656744263 0.0289251994 -0.0114867392
## Top10perc Top25perc P.Undergrad Room.Board
## 0.2037194524 0.0911240813 0.0011877556 0.0009993557
## Books PhD S.F.Ratio Expend
## 0.0138744140 0.1008030675 0.8026544203 0.0003172184
## Grad.Rate
## 0.1068480341
trimmed_bestsub_2 <- lm(transform_Apps ~Private+Accept+ Enroll+Top10perc+Top25perc+ P.Undergrad+Room.Board+
Books+PhD+S.F.Ratio+Expend+Grad.Rate, data = trimmed_train)
summary(trimmed_bestsub_2)
##
## Call:
## lm(formula = transform_Apps ~ Private + Accept + Enroll + Top10perc +
## Top25perc + P.Undergrad + Room.Board + Books + PhD + S.F.Ratio +
## Expend + Grad.Rate, data = trimmed_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.444 -5.803 -0.051 5.433 46.447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.5625335 4.5995690 -1.862 0.06327 .
## PrivateNot Private 10.5656744 1.6902350 6.251 9.01e-10 ***
## Accept 0.0289252 0.0009644 29.994 < 2e-16 ***
## Enroll -0.0114867 0.0022037 -5.213 2.77e-07 ***
## Top10perc 0.2037195 0.0775878 2.626 0.00892 **
## Top25perc 0.0911241 0.0585006 1.558 0.11997
## P.Undergrad 0.0011878 0.0004865 2.442 0.01498 *
## Room.Board 0.0009994 0.0005949 1.680 0.09363 .
## Books 0.0138744 0.0030863 4.495 8.71e-06 ***
## PhD 0.1008031 0.0396526 2.542 0.01133 *
## S.F.Ratio 0.8026544 0.1637948 4.900 1.31e-06 ***
## Expend 0.0003172 0.0001618 1.960 0.05052 .
## Grad.Rate 0.1068480 0.0360550 2.963 0.00319 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.71 on 480 degrees of freedom
## Multiple R-squared: 0.9207, Adjusted R-squared: 0.9187
## F-statistic: 464.2 on 12 and 480 DF, p-value: < 2.2e-16
Test the Model Prediction trimmed_bestsub_2
pred_trimmed_bestsub <- predict(trimmed_bestsub_2, test)
pred_trimmed_bestsub <- ((lambda_1*pred_trimmed_bestsub)+1)^(1/lambda_1)
head(pred_trimmed_bestsub,5)
## 8 13 15 18 20
## 2200.6358 1130.7939 499.0944 818.8541 2592.7988
#Absolute error mean, median, sd, max, min-------
abs_err_trimmed_bestsub <- abs(pred_trimmed_bestsub - test$Apps)
mean(abs_err_trimmed_bestsub)
## [1] 1007.595
median(abs_err_trimmed_bestsub)
## [1] 251.5226
sd(abs_err_trimmed_bestsub)
## [1] 2460.769
range(abs_err_trimmed_bestsub)
## [1] 6.774924e-01 1.983109e+04
IQR(abs_err_Tlm_trimmed)
## [1] 420.2313
#histogram and boxplot
hist(abs_err_trimmed_bestsub, breaks = 15)
boxplot(abs_err_trimmed_bestsub)
#Actual vs. Predicted
plot(test$Apps, pred_trimmed_bestsub, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
#Regularization
x <- model.matrix(transform_Apps ~ + . - Apps, data = train)[, -1] #remove intercept
y <- train$transform_Apps
lambda_ridge_grid <- 10 ^ seq(10, -2, length = 100)
head(lambda_ridge_grid,5)
## [1] 10000000000 7564633276 5722367659 4328761281 3274549163
#Apply Ridge Regression
library("glmnet")
ridgereg_1 <- glmnet(x, y, alpha = 0, lambda = lambda_ridge_grid)
dim(coef(ridgereg_1))
## [1] 18 100
#Cross validation to choose the best model
set.seed(1234)
ridge_cv <- cv.glmnet(x, y, alpha = 0, nfolds = 10)
#The mean cross-validated error
ridge_cv$cvm
## [1] 2964.9934 2941.6670 2937.5400 2934.7770 2931.7501 2928.4345 2924.8034
## [8] 2920.8276 2916.4754 2911.7123 2906.5009 2900.8007 2894.5679 2887.7551
## [15] 2880.3112 2872.1815 2863.3067 2853.6235 2843.0642 2831.5564 2819.0234
## [22] 2805.3837 2790.5512 2774.4357 2756.9425 2737.9733 2717.4265 2695.1978
## [29] 2671.1812 2645.2698 2617.3575 2587.3399 2555.1169 2520.5940 2483.6851
## [36] 2444.3150 2402.4222 2357.9619 2310.9092 2261.2625 2209.0465 2154.3160
## [43] 2097.1559 2037.6874 1976.0671 1912.4886 1847.1817 1780.4118 1712.4769
## [50] 1643.7041 1574.4445 1505.0669 1435.9509 1367.4789 1300.0274 1233.9595
## [57] 1169.6160 1107.3082 1047.3118 989.8617 935.1481 883.3151 834.4577
## [64] 788.5903 745.7719 705.9860 669.1542 635.1842 603.9622 575.3564
## [71] 549.2220 525.4061 503.7519 484.1018 466.2854 450.2011 435.6785
## [78] 422.5782 410.7715 400.1381 390.5665 381.9538 374.2049 367.2329
## [85] 360.9577 355.3014 350.2082 345.6137 341.4641 337.7110 334.3092
## [92] 331.2227 328.4146 325.8519 323.5140 321.3863 319.4330 317.6657
## [99] 316.0548 314.6837
#Estimate of standard error of cvm.
ridge_cv$cvsd
## [1] 229.38531 227.84597 227.05103 226.85872 226.64801 226.41716 226.16431
## [8] 225.88740 225.58421 225.25233 224.88912 224.49174 224.05710 223.58188
## [15] 223.06245 222.49493 221.87516 221.19862 220.46051 219.65566 218.77858
## [22] 217.82344 216.78405 215.65388 214.42608 213.09347 211.64860 210.08379
## [29] 208.39114 206.56265 204.59027 202.46599 200.18201 197.73081 195.10535
## [36] 192.29926 189.30701 186.12416 182.74755 179.17561 175.40856 171.44873
## [43] 167.30060 162.97123 158.47039 153.81063 149.00735 144.07882 139.04608
## [50] 133.93273 128.76464 123.56955 118.37662 113.21581 108.11724 103.11051
## [57] 98.22398 93.48405 88.91450 84.53587 80.36499 76.41460 72.69390
## [64] 69.20068 65.94306 62.91484 60.10898 57.51512 55.12121 52.91375
## [71] 50.87852 49.00116 47.26772 45.66516 44.17692 42.80176 41.52693
## [78] 40.34381 39.24591 38.22795 37.28581 36.41629 35.61702 34.88629
## [85] 34.22291 33.62619 33.09508 32.62949 32.22913 31.89372 31.62318
## [92] 31.41678 31.27260 31.19257 31.17202 31.21274 31.31343 31.46393
## [99] 31.67037 31.88659
#value of lambda that gives minimum cvm
ridge_cv$lambda.min
## [1] 4.953654
#Coefficients of regression w/ best_lambda
ridgereg <- glmnet(x, y, alpha = 0, lambda = ridge_cv$lambda.min)
coef(ridgereg)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -3.269253e+01
## PrivateNot Private 1.889714e+01
## Accept 1.012133e-02
## Enroll 1.157851e-02
## Top10perc 3.033624e-01
## Top25perc 8.463197e-02
## F.Undergrad 1.115164e-03
## P.Undergrad 6.489259e-04
## Outstate 5.146141e-04
## Room.Board 3.690088e-03
## Books 1.135417e-02
## Personal 9.137186e-04
## PhD 1.209991e-01
## Terminal 2.252635e-03
## S.F.Ratio 9.337835e-01
## perc.alumni -1.989425e-01
## Expend 1.011114e-03
## Grad.Rate 2.652834e-01
Test the Model Ridge Regression
#Create model matrix for test
test$transform_Apps <-((test$Apps ^ lambda_1) - 1) / lambda_1
x_test <- model.matrix(transform_Apps ~ + . - Apps, data = test)[, -1]#remove intercept
pred_ridgereg <- predict(ridgereg, s = ridge_cv$lambda.min, newx = x_test)
head(pred_ridgereg,5)
## 1
## 8 87.39334
## 13 70.89806
## 15 45.29346
## 18 56.38546
## 20 100.17449
pred_ridgereg <- ((lambda_1*pred_ridgereg)+1)^(1/lambda_1)
head(pred_ridgereg,5)
## 1
## 8 1997.7924
## 13 1328.5318
## 15 559.1679
## 18 852.2155
## 20 2609.9066
#Absolute error mean, median, sd, max, min
abs_err_ridgereg <- abs(pred_ridgereg - test$Apps)
mean(abs_err_ridgereg)
## [1] 797.11
median(abs_err_ridgereg)
## [1] 396.5723
sd(abs_err_ridgereg)
## [1] 1315.391
IQR(abs_err_ridgereg)
## [1] 625.6818
range(abs_err_ridgereg)
## [1] 1.6371 8856.8520
#Actual vs. Predicted
plot(test$Apps, pred_ridgereg, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
library("glmnet")
lassoreg_1 <- glmnet(x, y, alpha = 1, lambda = lambda_ridge_grid)
dim(coef(lassoreg_1))
## [1] 18 100
#Plot Reg. Coefficients vs. Log Lambda
plot(lassoreg_1, xvar = "lambda")
#Cross validation to choose the best model
set.seed(1234)
lasso_cv <- cv.glmnet(x, y, alpha = 1, nfolds = 10)
#The mean cross-validated error
lasso_cv$cvm
## [1] 2947.9467 2559.7017 2208.6449 1917.5853 1676.3016 1476.3108 1310.5730
## [8] 1173.2461 1059.4824 965.2593 887.2345 822.5246 768.3007 723.2994
## [15] 684.7024 642.8826 601.0385 565.0436 534.6617 509.4137 488.2328
## [22] 469.2745 451.2316 434.9290 421.0045 408.7920 397.7708 387.5685
## [29] 378.0573 370.0615 362.9722 356.7581 350.9709 345.3854 340.1411
## [36] 334.8402 329.8254 325.3377 321.7092 318.7207 316.3989 314.4964
## [43] 312.8172 311.4153 310.1963 309.1967 308.3239 307.5644 306.9472
## [50] 306.4893 306.1187 305.7969 305.4867 305.2694 305.0969 304.9713
## [57] 304.8520 304.7640 304.7023 304.6582 304.6268 304.5971 304.5799
## [64] 304.5978 304.6209 304.6783 304.7182 304.7648
#Estimate of standard error of cvm.
lasso_cv$cvsd
## [1] 234.42691 206.82279 174.79804 150.24636 131.43051 116.92530 105.60617
## [8] 96.62714 89.38338 83.46327 78.59908 74.63178 71.45641 69.08312
## [15] 67.59738 65.67452 62.37739 59.74929 57.61364 55.95526 54.65781
## [22] 53.51365 52.59322 51.39216 50.16230 49.07686 48.19189 47.40230
## [29] 46.80685 46.41385 45.93875 45.51774 45.10412 44.62993 44.22872
## [36] 43.98220 43.88357 43.82384 43.83979 43.86577 43.89339 43.94365
## [43] 44.00276 44.09507 44.13005 44.20173 44.26622 44.34475 44.41912
## [50] 44.49129 44.55797 44.60591 44.59931 44.61239 44.62322 44.63199
## [57] 44.64308 44.65535 44.66612 44.67608 44.68542 44.69704 44.70659
## [64] 44.71878 44.73081 44.76461 44.77174 44.77838
#value of lambda that gives minimum cvm
lasso_cv$lambda.min
## [1] 0.1548372
#Coefficients of regression w/ best_lambda
lassoreg_2 <- glmnet(x, y, alpha = 1, lambda = lasso_cv$lambda.min)
coef(lassoreg_2)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -2.655394e+01
## PrivateNot Private 2.037740e+01
## Accept 1.541375e-02
## Enroll 4.277179e-03
## Top10perc 4.565052e-01
## Top25perc .
## F.Undergrad .
## P.Undergrad 1.040365e-03
## Outstate 7.060850e-05
## Room.Board 3.467202e-03
## Books 1.093456e-02
## Personal 1.163522e-03
## PhD 8.006472e-02
## Terminal .
## S.F.Ratio 1.012786e+00
## perc.alumni -1.138125e-01
## Expend 1.042622e-03
## Grad.Rate 2.348864e-01
Test the Model Prediction Lasso Regression
pred_lassoreg <- predict(lassoreg_2, s = lasso_cv$lambda.min, newx = x_test)
head(pred_lassoreg,5)
## 1
## 8 88.50788
## 13 71.26303
## 15 46.45110
## 18 55.12598
## 20 101.94912
pred_lassoreg <- ((lambda_1*pred_lassoreg)+1)^(1/lambda_1)
head(pred_lassoreg,5)
## 1
## 8 2047.9192
## 13 1341.8680
## 15 586.8773
## 18 815.8445
## 20 2701.3551
#Absolute error mean, median, sd, max, min-------
abs_err_lassoreg <- abs(pred_lassoreg - test$Apps)
mean(abs_err_lassoreg)
## [1] 720.7397
median(abs_err_lassoreg)
## [1] 371.5482
sd(abs_err_lassoreg)
## [1] 1173.443
IQR(abs_err_lassoreg)
## [1] 581.7214
range(abs_err_lassoreg)
## [1] 1.616778 7866.567852
#Decision Tree Model Using All Variables
library("rpart")
tree_1 <- rpart(formula =transform_Apps~ . - Apps, data = train, cp = 0.0001, maxdepth = 17)
#Plot the tree
library("rpart.plot")
prp(tree_1)
#Prune the tree
plotcp(tree_1)
tree_1$cptable[which.min(tree_1$cptable[,"xerror"])]
## [1] 0.0001585943
#Prune the tree
tree_2 <- prune.rpart(tree_1,
cp = tree_1$cptable[which.min(tree_1$cptable[,"xerror"])])
#Plot the tree
prp(tree_2)
Test the Model Decision Tree
pred_tree <- predict(tree_2, test)
pred_tree <- ((lambda_1*pred_tree)+1)^(1/lambda_1)
head(pred_tree,5)
## 8 13 15 18 20
## 2490.0515 955.6153 417.6820 1162.6730 2951.7708
#Absolute error mean, median, sd, max, min-------
abs_err_tree <- abs(pred_tree - test$Apps)
mean(abs_err_tree)
## [1] 511.7333
median(abs_err_tree)
## [1] 166.96
sd(abs_err_tree)
## [1] 977.2577
IQR(abs_err_tree)
## [1] 427.0593
range(abs_err_tree)
## [1] 9.915969e-01 1.002958e+04
#Actual vs. Predicted
plot(test$Apps, pred_tree, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
library("randomForest")
set.seed(1234)
bagging_1 <- randomForest(transform_Apps ~ . - Apps, mtry = ncol(train) - 2, ntree = 500, data = train)
bagging_1
##
## Call:
## randomForest(formula = transform_Apps ~ . - Apps, data = train, mtry = ncol(train) - 2, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 17
##
## Mean of squared residuals: 143.4716
## % Var explained: 95.14
Test the Model Bagging
#Prediction: M8 Bagging
pred_bagging <- predict(bagging_1, test)
pred_bagging <- ((lambda_1*pred_bagging)+1)^(1/lambda_1)
head(pred_bagging,5)
## 8 13 15 18 20
## 2310.4751 963.6219 393.0022 1142.6620 2539.5328
#Absolute error mean, median, sd, max, min-------
abs_err_bagging <- abs(pred_bagging - test$Apps)
mean(abs_err_bagging)
## [1] 483.0405
median(abs_err_bagging)
## [1] 133.405
sd(abs_err_bagging)
## [1] 905.2469
IQR(abs_err_bagging)
## [1] 483.2991
range(abs_err_bagging)
## [1] 0.6534906 9092.7464852
#Actual vs. Predicted
plot(test$Apps, pred_bagging, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
set.seed(1234)
rf_1 <- randomForest(transform_Apps ~ . - Apps, data = train, ntree = 500, importance = TRUE)
rf_1
##
## Call:
## randomForest(formula = transform_Apps ~ . - Apps, data = train, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 149.2345
## % Var explained: 94.94
importance(rf_1)
## %IncMSE IncNodePurity
## Private 6.826303 54012.429
## Accept 35.357239 603563.626
## Enroll 19.642033 373794.371
## Top10perc 11.473543 27159.186
## Top25perc 12.545436 36526.523
## F.Undergrad 15.744020 275972.687
## P.Undergrad 7.086237 46690.799
## Outstate 14.459781 30033.667
## Room.Board 10.123740 12799.445
## Books 1.396802 5489.845
## Personal 3.730616 6508.770
## PhD 8.138054 40276.026
## Terminal 9.019460 25454.053
## S.F.Ratio 5.812609 8956.428
## perc.alumni 5.129829 6936.498
## Expend 8.403681 24034.705
## Grad.Rate 9.947238 20333.004
varImpPlot(rf_1)
#K-fold Cross-Validation for feature selection
dim(train)
## [1] 543 19
set.seed(12345)
rf_cv <- rfcv(train[, - c(2, 19)],
train$transform_Apps,
cv.fold = 10,
step = 0.75,
mtry = function(p) max(1, floor(sqrt(p))),
recursive = FALSE)
class(rf_cv)
## [1] "list"
str(rf_cv)
## List of 3
## $ n.var : num [1:9] 17 13 10 7 5 4 3 2 1
## $ error.cv : Named num [1:9] 169 164 152 151 164 ...
## ..- attr(*, "names")= chr [1:9] "17" "13" "10" "7" ...
## $ predicted:List of 9
## ..$ 17: num [1:543] 224 146 59 129 154 ...
## ..$ 13: num [1:543] 223.8 144.2 58.7 129 153.4 ...
## ..$ 10: num [1:543] 227.6 147.1 57.4 129.9 148.2 ...
## ..$ 7 : num [1:543] 236.4 151.4 58.3 133.9 149.9 ...
## ..$ 5 : num [1:543] 241.1 153.4 57.9 134.8 146.8 ...
## ..$ 4 : num [1:543] 243.4 151.6 58.1 134.6 150 ...
## ..$ 3 : num [1:543] 236.2 152.6 59.7 140.6 148.8 ...
## ..$ 2 : num [1:543] 249.9 171.8 57.7 145.5 159.1 ...
## ..$ 1 : num [1:543] 264.1 198.6 57.4 129.4 162.6 ...
#Vector of number of variables used at each step
rf_cv$n.var
## [1] 17 13 10 7 5 4 3 2 1
#Corresponding vector of MSEs at each step
rf_cv$error.cv
## 17 13 10 7 5 4 3 2
## 169.0763 164.4601 152.2640 151.0196 164.1861 177.6105 286.1920 288.5219
## 1
## 303.9524
which.min(rf_cv$error.cv)
## 7
## 4
#Remove 10 variables based on Importance of Variables
sort(importance(rf_1)[,1])
## Books Personal perc.alumni S.F.Ratio Private P.Undergrad
## 1.396802 3.730616 5.129829 5.812609 6.826303 7.086237
## PhD Expend Terminal Grad.Rate Room.Board Top10perc
## 8.138054 8.403681 9.019460 9.947238 10.123740 11.473543
## Top25perc Outstate F.Undergrad Enroll Accept
## 12.545436 14.459781 15.744020 19.642033 35.357239
#Regression formula
reg_formula <- as.formula(transform_Apps ~Room.Board+Top10perc+Top25perc+Outstate+F.Undergrad+Enroll+Accept)
reg_formula
## transform_Apps ~ Room.Board + Top10perc + Top25perc + Outstate +
## F.Undergrad + Enroll + Accept
#mtry
floor(sqrt(7))
## [1] 2
set.seed(1234)
rf_2 <- randomForest(reg_formula, data = train, mtry = 2, ntree = 500)
rf_2
##
## Call:
## randomForest(formula = reg_formula, data = train, mtry = 2, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 137.571
## % Var explained: 95.34
Test the Model Random Forrest (using K-fold Cross-Validation)
#Prediction: Random Forrest
pred_rf <- predict(rf_2, test)
pred_rf <- ((lambda_1*pred_rf)+1)^(1/lambda_1)
head(pred_rf,5)
## 8 13 15 18 20
## 2333.8852 978.4337 436.9443 1098.3207 2588.3653
#Absolute error mean, median, sd, max, min-------
abs_err_rf <- abs(pred_rf - test$Apps)
mean(abs_err_rf)
## [1] 439.2642
median(abs_err_rf)
## [1] 176.5174
sd(abs_err_rf)
## [1] 693.1222
IQR(abs_err_rf)
## [1] 453.6405
range(abs_err_rf)
## [1] 1.055278 6566.613865
#Actual vs. Predicted
plot(test$Apps, pred_rf, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
set.seed(1234)
trimmedbagging_1 <- randomForest(transform_Apps ~ . - Apps, mtry = ncol(trimmed_train) - 2, ntree = 500, data = trimmed_train)
trimmedbagging_1
##
## Call:
## randomForest(formula = transform_Apps ~ . - Apps, data = trimmed_train, mtry = ncol(trimmed_train) - 2, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 17
##
## Mean of squared residuals: 78.25712
## % Var explained: 94.45
Test the Model Bagging with Trimmed Train
#Prediction Bagging
pred_trimmedbagging <- predict(trimmedbagging_1, test)
pred_trimmedbagging <- ((lambda_1*pred_trimmedbagging)+1)^(1/lambda_1)
head(pred_trimmedbagging,5)
## 8 13 15 18 20
## 2317.5671 960.4946 372.0066 1145.9394 2582.6366
#Absolute error mean, median, sd, max, min-------
abs_err_trimmedbagging <- abs(pred_trimmedbagging - test$Apps)
mean(abs_err_trimmedbagging)
## [1] 735.7105
median(abs_err_trimmedbagging)
## [1] 147.8909
sd(abs_err_trimmedbagging)
## [1] 1634.94
IQR(abs_err_trimmedbagging)
## [1] 441.6711
range(abs_err_trimmedbagging)
## [1] 0.3129347 9394.1215224
#Actual vs. Predicted
plot(test$Apps, pred_trimmedbagging, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
library("gbm")
library("xgboost")
library("ggplot2")
set.seed(123)
#train GBM model
gbm_1 <- gbm(formula =transform_Apps ~ . - Apps,
distribution = "gaussian",
data = train,
n.trees = 10000, #the total number of trees to fit
interaction.depth = 1, #1: stump, the maximum depth of each tree
shrinkage = 0.001, #learning rate
cv.folds = 5, #Number of cross-validation folds to perform
n.cores = NULL, #will use all cores by default
verbose = FALSE)
#get MSE and compute RMSE
min(gbm_1$cv.error) #MSE
## [1] 158.3526
sqrt(min(gbm_1$cv.error)) #RMSE
## [1] 12.58382
#plot loss function as a result of n trees added to the ensemble
gbm.perf(gbm_1, method = "cv")
## [1] 9999
#returns the estimated optimal number of iterations
#Use different parameters
set.seed(123)
#Tuning
#Create hyper-parameter grid
par_grid <- expand.grid(shrinkage = c(0.001, 0.005, 0.01),
interaction_depth = c(1, 3, 5),
n_minobsinnode = c(5, 10, 15),
bag_fraction = c(0.5, 0.7, 0.9)
)
nrow(par_grid)
## [1] 81
#Grid search (train/validation approach)
for(i in 1:nrow(par_grid)) {
set.seed(123)
#train model
gbm_tune <- gbm(formula = transform_Apps ~ . - Apps,
distribution = "gaussian",
data = train,
n.trees = 5000,
interaction.depth = par_grid$interaction_depth[i],
shrinkage = par_grid$shrinkage[i],
n.minobsinnode = par_grid$n_minobsinnode[i],
bag.fraction = par_grid$bag_fraction[i],
train.fraction = 0.8,
cv.folds = 0,
n.cores = NULL, #will use all cores by default
verbose = FALSE)
#add min training error and trees to grid
par_grid$optimal_trees[i] <- which.min(gbm_tune$valid.error)
par_grid$min_RMSE[i] <- sqrt(min(gbm_tune$valid.error))
}
knitr::kable(head(par_grid,5))
| shrinkage | interaction_depth | n_minobsinnode | bag_fraction | optimal_trees | min_RMSE |
|---|---|---|---|---|---|
| 0.001 | 1 | 5 | 0.5 | 5000 | 11.621141 |
| 0.005 | 1 | 5 | 0.5 | 4729 | 9.039213 |
| 0.010 | 1 | 5 | 0.5 | 4790 | 8.983764 |
| 0.001 | 3 | 5 | 0.5 | 5000 | 8.700245 |
| 0.005 | 3 | 5 | 0.5 | 2748 | 8.304058 |
#Modify hyper-parameter grid
par_grid <- expand.grid(shrinkage = c(0.005, 0.007, 0.01),
interaction_depth = c(3, 4, 5),
n_minobsinnode = c(4, 5, 6),
bag_fraction = c(0.5, 0.5, 0.7) #stochastic gradient :bag.fraction < 1
)
#Grid search
for(i in 1:nrow(par_grid)) {
set.seed(123)
#train model
gbm_tune <- gbm(formula = transform_Apps ~ . - Apps,
distribution = "gaussian",
data = train,
n.trees = 5000,
interaction.depth = par_grid$interaction_depth[i],
shrinkage = par_grid$shrinkage[i],
n.minobsinnode = par_grid$n_minobsinnode[i],
bag.fraction = par_grid$bag_fraction[i],
train.fraction = 0.8,
cv.folds = 0,
n.cores = NULL, #will use all cores by default
verbose = FALSE)
#add min training error and trees to grid
par_grid$optimal_trees[i] <- which.min(gbm_tune$valid.error)
par_grid$min_RMSE[i] <- sqrt(min(gbm_tune$valid.error))
}
knitr::kable(head(par_grid,5))
| shrinkage | interaction_depth | n_minobsinnode | bag_fraction | optimal_trees | min_RMSE |
|---|---|---|---|---|---|
| 0.005 | 3 | 4 | 0.5 | 3262 | 8.284006 |
| 0.007 | 3 | 4 | 0.5 | 2621 | 8.366324 |
| 0.010 | 3 | 4 | 0.5 | 2071 | 8.304693 |
| 0.005 | 4 | 4 | 0.5 | 4358 | 8.315379 |
| 0.007 | 4 | 4 | 0.5 | 1573 | 8.379029 |
#Final Model
gbm_3 <- gbm(formula =transform_Apps ~ . - Apps,
distribution = "gaussian",
data = train,
n.trees = 2500,
interaction.depth = 5,
shrinkage =0.010,
n.minobsinnode = 5,
bag.fraction = 0.5,
train.fraction = 0.8,
cv.folds = 0,
n.cores = NULL, #will use all cores by default
)
summary(gbm_3)
## var rel.inf
## Accept Accept 85.27310170
## Top10perc Top10perc 2.43175401
## F.Undergrad F.Undergrad 2.34824064
## Top25perc Top25perc 2.07351122
## Enroll Enroll 1.68191087
## Outstate Outstate 1.02719088
## Grad.Rate Grad.Rate 0.89932108
## Room.Board Room.Board 0.71521635
## Expend Expend 0.67491768
## PhD PhD 0.59102565
## Books Books 0.46780972
## P.Undergrad P.Undergrad 0.44227437
## perc.alumni perc.alumni 0.43011173
## S.F.Ratio S.F.Ratio 0.36160514
## Personal Personal 0.30870081
## Terminal Terminal 0.23792353
## Private Private 0.03538463
Test the Model GB Regression
#Model: gbm_3
#Prediction
pred_gbm <- predict(gbm_3, n.trees = 2500, newdata = test)
head(pred_gbm,5)
## [1] 93.02486 58.26126 37.75944 62.04406 98.87365
pred_gbm <- ((lambda_1*pred_gbm)+1)^(1/lambda_1)
head(pred_gbm,5)
## [1] 2257.4308 907.8548 395.2033 1025.4105 2543.8731
#Absolute error mean, median, sd, max, min-------
abs_err_gbm <- abs(pred_gbm - test$Apps)
mean(abs_err_gbm)
## [1] 459.865
median(abs_err_gbm)
## [1] 150.3577
sd(abs_err_gbm)
## [1] 769.8837
IQR(abs_err_gbm)
## [1] 412.7232
range(abs_err_gbm)
## [1] 8.244238e-02 5.293883e+03
#Actual vs. Predicted
plot(test$Apps, pred_gbm, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)
x <- model.matrix(transform_Apps ~ . - Apps, data = train)[, -1] #remove intercept
y <- train$transform_Apps
set.seed(123)
xgb_1 <- xgboost(data = x,
label = y,
eta = 0.1, #learning rate
lambda = 0, #regularization term
max_depth = 8, #tree depth
nround = 1000, #max number of boosting iterations
subsample = 0.65, #percent of training data to sample for each tree
objective = "reg:squarederror", #for regression models
verbose = 0 #silent
)
#train RMSE
xgb_1$evaluation_log
## iter train_rmse
## 1: 1 96.346611
## 2: 2 86.847481
## 3: 3 78.186195
## 4: 4 70.431076
## 5: 5 63.530521
## ---
## 996: 996 0.000425
## 997: 997 0.000425
## 998: 998 0.000425
## 999: 999 0.000425
## 1000: 1000 0.000425
#plot error vs number trees
ggplot(xgb_1$evaluation_log) +
geom_line(aes(iter, train_rmse), color = "red")
#Tuning(Train/validation using xgboost)
#Train and validation sets
set.seed(1234)
train_cases <- sample(1:nrow(train), nrow(train) * 0.8)
#Train data set
train_xgboost <- train[train_cases,]
dim(train_xgboost)
## [1] 434 19
#Model Matrix
xtrain <- model.matrix(transform_Apps ~ . - Apps, data = train_xgboost)[, -1] #remove intercept
ytrain <- train_xgboost$transform_Apps
#Validation data set
validation_xgboost <- train[- train_cases,]
dim(validation_xgboost)
## [1] 109 19
xvalidation <- model.matrix(transform_Apps ~ . - Apps, data = validation_xgboost)[, -1] #remove intercept
yvalidation <- validation_xgboost$transform_Apps
#Create hyper-parameter grid
par_grid <- expand.grid(eta = c(0.01, 0.05, 0.1, 0.3),
lambda = c(0, 1, 2, 5),
max_depth = c(1, 3, 5, 7),
subsample = c(0.65, 0.8, 1),
colsample_bytree = c(0.8, 0.9, 1))
dim(par_grid)
## [1] 576 5
#Grid search
for(i in 1:nrow(par_grid )) {
set.seed(123)
#train model
xgb_tune <- xgboost(data = xtrain,
label = ytrain,
eta = par_grid$eta[i],
max_depth = par_grid$max_depth[i],
subsample = par_grid$subsample[i],
colsample_bytree = par_grid$colsample_bytree[i],
nrounds = 1000,
objective = "reg:squarederror", #for regression models
verbose = 0, #silent,
early_stopping_rounds = 10 #stop if no improvement for 10 consecutive trees
)
#prediction on validation data set
pred_xgb_validation <- predict(xgb_tune, xvalidation)
rmse <- sqrt(mean((yvalidation - pred_xgb_validation) ^ 2))
#add validation error
par_grid$RMSE[i] <- rmse
}
knitr::kable(head(par_grid,5))
| eta | lambda | max_depth | subsample | colsample_bytree | RMSE |
|---|---|---|---|---|---|
| 0.01 | 0 | 1 | 0.65 | 0.8 | 12.27853 |
| 0.05 | 0 | 1 | 0.65 | 0.8 | 11.85737 |
| 0.10 | 0 | 1 | 0.65 | 0.8 | 12.39560 |
| 0.30 | 0 | 1 | 0.65 | 0.8 | 14.49645 |
| 0.01 | 1 | 1 | 0.65 | 0.8 | 12.27853 |
#Final Model
set.seed(123)
xgb_2 <- xgboost(data = x,
label = y,
eta = 0.05, #learning rate
max_depth = 3, #tree depth
lambda = 0,
nround = 1000,
colsample_bytree = 0.8,
subsample = 0.65, #percent of training data to sample for each tree
objective = "reg:squarederror", #for regression models
verbose = 0 #silent
)
Test the Model XGBoost Regression
#Model: xgb_2
x_test <- model.matrix(transform_Apps ~ . - Apps, data = test)[, -1]#remove intercept
pred_xgb <- predict(xgb_2, x_test)
head(pred_xgb,5)
## [1] 91.04086 59.17511 38.16672 65.59778 98.59235
pred_xgb <- ((lambda_1*pred_xgb)+1)^(1/lambda_1)
head(pred_xgb,5)
## [1] 2164.1505 935.5984 403.3414 1142.3649 2529.7054
#Absolute error mean, median, sd, max, min-------
abs_err_xgb <- abs(pred_xgb - test$Apps)
mean(abs_err_xgb)
## [1] 447.7819
median(abs_err_xgb)
## [1] 183.3197
sd(abs_err_xgb)
## [1] 699.316
IQR(abs_err_xgb)
## [1] 437.3617
range(abs_err_xgb)
## [1] 4.182853 5387.825432
df <- data.frame("Model_1" = abs_err_Tlm,
"Model_2" = abs_err_Tlm_trimmed,
"Model_3" = abs_err_lm_transform,
"Model_4" = abs_err_bestsub_adj2,
"Model_5" =abs_err_bestsub_CP,
"Model_6" =abs_err_bestsub_BIC,
"Model_7" =abs_err_trimmed_bestsub,
"Model_8" =abs_err_ridgereg,
"Model_9" =abs_err_lassoreg,
"Model_10" =abs_err_tree,
"Model_11" =abs_err_bagging,
"Model_11" =abs_err_rf,
"Model_13" =abs_err_trimmedbagging,
"Model_14" =abs_err_gbm,
"Model_15" =abs_err_xgb )
models_comp <- data.frame("Mean of AbsErrors" = apply(df, 2, mean),
"Median of AbsErrors" = apply(df, 2, median),
"SD of AbsErrors" = apply(df, 2, sd),
"IQR of AbsErrors" = apply(df, 2, IQR),
"Min of AbsErrors" = apply(df, 2, min),
"Max of AbsErrors" = apply(df, 2, max))
rownames(models_comp) <- c("Tlm",
"Tlm_trimmed",
"lm_transform",
"bestsub_adj2",
"bestsub_CP",
"bestsub_BIC",
"trimmed_bestsub",
"ridgereg",
"Lassoreg",
"Tree",
"bagging",
"RandomFarest",
"Trimmedbagging",
"Gbm",
"Xgb ")
models_comp <- rbind(models_comp, "lm_transform_trimmed_test" = c(mean(abs_err_lm_transform_trimmed_test),
median(abs_err_lm_transform_trimmed_test),
sd(abs_err_lm_transform_trimmed_test),
IQR(abs_err_lm_transform_trimmed_test),
range(abs_err_lm_transform_trimmed_test)))
models_comp<-models_comp[c(1,2,3,16,4,5,6,7,8,9,10,11,12,13,14,15),]
library(knitr)
kable(models_comp)
| Mean.of.AbsErrors | Median.of.AbsErrors | SD.of.AbsErrors | IQR.of.AbsErrors | Min.of.AbsErrors | Max.of.AbsErrors | |
|---|---|---|---|---|---|---|
| Tlm | 599.7741 | 375.7187 | 734.9941 | 567.6974 | 2.0195353 | 6893.422 |
| Tlm_trimmed | 511.9436 | 253.6845 | 910.2706 | 420.2313 | 4.9054292 | 9327.358 |
| lm_transform | 711.1906 | 370.7594 | 1116.9812 | 578.1689 | 0.5630465 | 7773.258 |
| lm_transform_trimmed_test | 498.4068 | 326.4810 | 567.9842 | 528.3034 | 0.5630465 | 4022.992 |
| bestsub_adj2 | 722.8981 | 378.1958 | 1173.3939 | 593.8777 | 2.9694363 | 7858.990 |
| bestsub_CP | 724.3431 | 363.9783 | 1174.1980 | 581.8620 | 0.9072678 | 7946.280 |
| bestsub_BIC | 711.1906 | 370.7594 | 1116.9812 | 578.1689 | 0.5630465 | 7773.258 |
| trimmed_bestsub | 1007.5954 | 251.5226 | 2460.7692 | 500.0060 | 0.6774924 | 19831.086 |
| ridgereg | 797.1100 | 396.5723 | 1315.3914 | 625.6818 | 1.6371001 | 8856.852 |
| Lassoreg | 720.7397 | 371.5482 | 1173.4429 | 581.7214 | 1.6167782 | 7866.568 |
| Tree | 511.7333 | 166.9600 | 977.2577 | 427.0593 | 0.9915969 | 10029.575 |
| bagging | 483.0405 | 133.4050 | 905.2469 | 483.2991 | 0.6534906 | 9092.746 |
| RandomFarest | 439.2642 | 176.5174 | 693.1222 | 453.6405 | 1.0552783 | 6566.614 |
| Trimmedbagging | 735.7105 | 147.8909 | 1634.9395 | 441.6711 | 0.3129347 | 9394.122 |
| Gbm | 459.8650 | 150.3577 | 769.8837 | 412.7232 | 0.0824424 | 5293.883 |
| Xgb | 447.7819 | 183.3197 | 699.3160 | 437.3617 | 4.1828526 | 5387.825 |
according to different index such as: Mean of AbsErrors , Median of AbsErrors, SD of AbsErrors, IQR of AbsErrors best Model is different, for example: the Random Forest model is the best model in terms of Mean.of.Abs.Errors, while the Median.of.Abs.Errors model works better in the bagging method.
.
According to the study, if universities are divided into two or more different categories and predict the number of variable apps in each category separately, we will reach a more accurate understanding.
As can be seen, there is a high correlation between the variables: Accept, Enroll and F.Undergrad. So we may be better to use the principal component analysis (PCA) method, which is a is a powerful dimension-reduction method, to help model perform better.
These 2 above approaches can be checked in other study.
Almost all linear regression models built on the training data set violated regression assumptions including normality of Residuals. Therefore, we removed one Observation data that was in several other variables is also Outlier, which solved the Leverage problem. Then we Transform the dependent variable using the Box-Cox method, However, does not solve the problem of violating the Residuals Normality assumption. that makes the predictions in such Models impossible to generalize.
For this reason, we went to other methods such as Stepwise Regression, Random Forest, etc., whose purpose is only data-based prediction, and it is not necessary to check the Residuals Normality assumption for that kinds of Models.
END of THE CODE.